Entitlements could not be checked due to an error reaching the service. Showing non-confidential search results only.

Run Llama With PyTorch on Arm-Based Infrastructure

What you’ll build:

You’ll create a browser-based large language model (LLM) application that runs Llama 3.1 quantized to INT4, with a Streamlit frontend and a torchchat backend, that runs entirely on an Arm-based AWS Graviton CPU.

What you’ll learn:

To download the Meta Llama 3.1 model from the Meta Hugging Face repository.
4-bit quantize the model using optimized INT4 KleidiAI kernels for PyTorch.
Run an LLM inference using PyTorch on an Arm-based CPU.
Expose an LLM inference as a browser application with Streamlit as the frontend and torchchat framework in PyTorch as the LLM backend server.
Measure performance metrics of the LLM inference running on an Arm-based CPU.

Watch the on-demand session below, or start building with the Run a Large Language Model chatbot with PyTorch learning path and follow the same workflow at your own pace.

<wistia-player media-id="a0d9onw10j" aspect="1.7777777777777777"></wistia-player>