Distributed Generative AI Inference on Arm

As generative AI becomes more efficient, large language models (LLMs) are likewise shrinking in size. This creates new opportunities to run LLMs on more efficient hardware, on cloud machines doing AI inference complete on CPUs, built on Arm. Wha...

By Waheed Brown

Reading time 5 minutes

Generative AI is becoming more efficient, and large language models (LLMs) are shrinking in size. This creates new opportunities to run LLMs on more efficient hardware. For example, cloud services can now run AI inference on Arm-based CPUs.

What is distributed AI inference?

AI inference happens when an AI application processes a user’s prompt. With LLMs, users commonly enter text into a chatbot's prompt window, press "send", and then this request is processed on an AI server. AI inference does not need to run on a single machine or virtual machine (VM). To scale inference beyond a single machine's memory (RAM or VRAM), LLM weights and computations are often distributed across many machines.

How are LLM Weights and Computations Distributed?

LLM weights and computations are often distributed using a client-server model. One machine is appointed as the main node (client). The remaining machines function as worker nodes (servers). Each worker loads a shard of the model and participates in parallel computation.

Using an AI framework like llama.cpp, LLM weights and computations can be distributed across machines using RPC:

# A llama.cpp Worker node listen for inference from the Main node.
rpc-server -p 50052 -H 0.0.0.0 -t 64

# The llama.cpp Main node initiates distributed LLM inference
llama-cli -m model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99

Explanation of rpc-server parameters:

-p 50052 The listening TCP port on the Worker.
-H 0.0.0.0 The IP address of the Main node; 0.0.0.0 accepts requests from any IP address.
-t 64 CPU thread count.

Explanation of llama-cli parameters:

-m model.gguf Specifies the quantized LLaMA model file (GGUF format) to load.
-p "Tell me a joke" The prompt passed to the model for generation.
-n 128 Maximum number of tokens to generate in the output.
--rpc "$worker_ips" A parameter assigned the value of a comma-separated list of worker IP addresses.
-ngl 99 The number of the LLM's neural network layers to delegate to a GPU.

On CPU-only Arm cloud machines, distributed inference runs entirely on CPUs. This makes a worker node's thread count, -t 64, the key parameter. Set the number of threads to match the number of CPU cores on each worker node. Note that on CPU-only Arm machines the GPU delegation parameter, -ngl 99, is ignored.

What Arm Cloud Machines are Available for Distributed Inference?

All three major cloud providers have Arm machines that are suitable for distributed inference:

Cloud Provider	Arm Cloud Machine Type	Example Instances
Amazon Web Services	AWS Graviton CPU VMs NVIDIA Grace Arm CPU VMs	AWS Graviton 2 VMs: C6g, C6gn G5g (NVIDIA T4G Tensor Core GPUs) I4g, Im4gn, Is4gen M6g R6g T4g X2gd AWS Graviton 3 VMs: C7g, C7gn HPC7g M7g R7g AWS Graviton 4 VMs: C8g, C8gn I8g M8g R8g X8g NVIDIA Grace Arm CPU VMs: P6e (NVIDIA GB200 GPUs)
Google Cloud	Google Axion CPU VMs NVIDIA Grace Arm CPU VMs	Google Axion CPU VMs: C4A Tau T2A NVIDIA Grace Arm CPU VMs: A4x (NVIDIA GB200 GPUs)
Microsoft Azure	Azure Cobalt 100 CPU VMs NVIDIA Grace Arm CPU VMs	Azure Cobalt 100 CPU VMs: Dplsv6, Dpldsv6 Dpsv6, Dpdsv6 Epsv6, Epdsv6 NVIDIA Grace Arm CPU VMs: ND GB200-v6

What is Next?

To learn more about inference using llama.cpp on Arm, visit our Arm Learning Path.

Arm Learning Path

By Waheed Brown

Article text

Re-use is only permitted for informational and non-commercial or personal use only.