Entitlements could not be checked due to an error reaching the service. Showing non-confidential search results only.

Run Llama With PyTorch on Arm-Based Infrastructure

What you’ll build:


  • You’ll create a browser-based large language model (LLM) application that runs Llama 3.1 quantized to INT4, with a Streamlit frontend and a torchchat backend, that runs entirely on an Arm-based AWS Graviton CPU.

What you’ll learn:


  • To download the Meta Llama 3.1 model from the Meta Hugging Face repository.
  • 4-bit quantize the model using optimized INT4 KleidiAI kernels for PyTorch.
  • Run an LLM inference using PyTorch on an Arm-based CPU.
  • Expose an LLM inference as a browser application with Streamlit as the frontend and torchchat framework in PyTorch as the LLM backend server.
  • Measure performance metrics of the LLM inference running on an Arm-based CPU.

Watch the on-demand session below, or start building with the Run a Large Language Model chatbot with PyTorch learning path and follow the same workflow at your own pace.



<wistia-player media-id="a0d9onw10j" aspect="1.7777777777777777"></wistia-player>

Recommendation For You

Arm Cloud Migration

Access guides to help migrate workloads to Arm-based cloud instances, enhancing performance and efficiency across various applications.

Explore More
Arm Developer Program

Connect with a global community, access the latest tools and technical resources, and accelerate your software development on Arm.

Join the Arm Developer Program