Entitlements could not be checked due to an error reaching the service. Showing non-confidential search results only.

Run Llama With PyTorch on Arm-Based Infrastructure

What you’ll build:


  • You’ll create a browser-based large language model (LLM) application that runs Llama 3.1 quantized to INT4, with a Streamlit frontend and a torchchat backend, that runs entirely on an Arm-based AWS Graviton CPU.

What you’ll learn:


  • To download the Meta Llama 3.1 model from the Meta Hugging Face repository.
  • 4-bit quantize the model using optimized INT4 KleidiAI kernels for PyTorch.
  • Run an LLM inference using PyTorch on an Arm-based CPU.
  • Expose an LLM inference as a browser application with Streamlit as the frontend and torchchat framework in PyTorch as the LLM backend server.
  • Measure performance metrics of the LLM inference running on an Arm-based CPU.

Watch the on-demand session below, or start building with the Run a Large Language Model chatbot with PyTorch learning path and follow the same workflow at your own pace.



Recommendation For You

Arm Cloud Migration

Access guides to help migrate workloads to Arm-based cloud instances, enhancing performance and efficiency across various applications.

Explore More
Arm Developer Program

Connect with a global community, access the latest tools and technical resources, and accelerate your software development on Arm.

Join the Arm Developer Program