HomeCommunityServers and Cloud Computing blog
August 18, 2025

Distributed Generative AI Inference on Arm

As generative AI becomes more efficient, large language models (LLMs) are likewise shrinking in size. This creates new opportunities to run LLMs on more efficient hardware, on cloud machines doing AI inference complete on CPUs, built on Arm. Wha...

By Waheed Brown

Share
Reading time 5 minutes

Generative AI is becoming more efficient, and large language models (LLMs) are shrinking in size. This creates new opportunities to run LLMs on more efficient hardware. For example, cloud services can now run AI inference on Arm-based CPUs.

What is distributed AI inference?

AI inference happens when an AI application processes a user’s prompt. With LLMs, users commonly enter text into a chatbot's prompt window, press "send", and then this request is processed on an AI server. AI inference does not need to run on a single machine or virtual machine (VM). To scale inference beyond a single machine's memory (RAM or VRAM), LLM weights and computations are often distributed across many machines.

How are LLM Weights and Computations Distributed?

LLM weights and computations are often distributed using a client-server model. One machine is appointed as the main node (client). The remaining machines function as worker nodes (servers). Each worker loads a shard of the model and participates in parallel computation.

Using an AI framework like llama.cpp, LLM weights and computations can be distributed across machines using RPC:

# A llama.cpp Worker node listen for inference from the Main node.
rpc-server -p 50052 -H 0.0.0.0 -t 64
# The llama.cpp Main node initiates distributed LLM inference
llama-cli -m model.gguf -p "Tell me a joke" -n 128 --rpc "$worker_ips" -ngl 99

Explanation of rpc-server parameters:

  • -p 50052 The listening TCP port on the Worker.
  • -H 0.0.0.0 The IP address of the Main node; 0.0.0.0 accepts requests from any IP address.
  • -t 64 CPU thread count.

Explanation of llama-cli parameters:

  • -m model.gguf Specifies the quantized LLaMA model file (GGUF format) to load.
  • -p "Tell me a joke" The prompt passed to the model for generation.
  • -n 128 Maximum number of tokens to generate in the output.
  • --rpc "$worker_ips" A parameter assigned the value of a comma-separated list of worker IP addresses.
  • -ngl 99 The number of the LLM's neural network layers to delegate to a GPU.

On CPU-only Arm cloud machines, distributed inference runs entirely on CPUs. This makes a worker node's thread count, -t 64, the key parameter. Set the number of threads to match the number of CPU cores on each worker node. Note that on CPU-only Arm machines the GPU delegation parameter, -ngl 99, is ignored.

What Arm Cloud Machines are Available for Distributed Inference?

All three major cloud providers have Arm machines that are suitable for distributed inference:

Cloud Provider Arm Cloud Machine Type Example Instances
Amazon Web Services AWS Graviton CPU VMs NVIDIA Grace Arm CPU VMs AWS Graviton 2 VMs:
AWS Graviton 3 VMs:
AWS Graviton 4 VMs:
NVIDIA Grace Arm CPU VMs:
  • P6e (NVIDIA GB200 GPUs)
Google Cloud Google Axion CPU VMs NVIDIA Grace Arm CPU VMs Google Axion CPU VMs: NVIDIA Grace Arm CPU VMs:
  • A4x (NVIDIA GB200 GPUs)
Microsoft Azure Azure Cobalt 100 CPU VMs NVIDIA Grace Arm CPU VMs Azure Cobalt 100 CPU VMs: NVIDIA Grace Arm CPU VMs:

What is Next?

To learn more about inference using llama.cpp on Arm, visit our Arm Learning Path.

Arm Learning Path


1Log in to like this post
Share

Article text

Re-use is only permitted for informational and non-commercial or personal use only.

placeholder