Optimize Llama.cpp with Arm I8MM instruction

Boosted Llama.cpp Q6\_K & Q4\_K inference using Arm's I8MM (smmla) for faster, efficient int8 matrix multiplies on Neoverse-N2 CPUs.

By Yibo Cai

Reading time 4 minutes

Optimize Llama.cpp with Arm I8MM instruction

This blog post introduces our practice in optimizing Llama.cpp Q6_K and Q4_K quantized model inference with Arm I8MM featured instructions. More specifically, the signed 8-bit integer matrix multiply accumulate instruction smmla.

Understanding Llama.cpp quantization

Llama.cpp is an open-source C++ library for running large language models, optimized for fast CPU inference. It leverages techniques like quantization (for example, 8-bit or 4-bit integer formats) to reduce memory usage and accelerate computations, enabling efficient model deployment on consumer and server-grade hardware.

Llama.cpp supports different kinds of quantization. Quantization balances model accuracy and performance. Smaller data sizes improve inference speed but can reduce accuracy by increasing perplexity.

For example, Q8_0 uses 8-bit integer to represent one datapoint, while Q6_K reduces data size to 6 bits.

Quantization is done in blocks. Data points in a block share a single scale factor.

For example, Q8_0 is processed in blocks of 32 datapoints, as explained here:

Retrieve 32 floating-point values from the original data, write down as f[0:32]

Calculate maximal of absolute values, that is, mf = max(abs(f[0:32]))

Calculate scale_factor = mf / (max(int8)) = mf / 127

Quantization: q[i] = round(f[i] / scale_factor)

Dequantization: v[i] = q[i] * scale_factor

Q6_K is more complicated. As shown in the picture below, datapoints are organized in two levels:

One super block of 256 datapoints. One super block scale factor in floating point format.

Each super block consists of 16 blocks. There are 16 datapoints in each block, which share one block level scale factor in integer format.

Llama.cpp Q6_K quantization

Figure1: Llama.cpp Q6_K quantization

Optimize Llama.cpp with Arm I8MM instruction

Like most AI workloads, for LLM inference, most CPU cycles are cost in matrix multiplication. Arm I8MM, more specifically, the smmla instruction accelerates 8-bit integer matrix multiplication.

To explain what smmla does and why it’s efficient, assume we want to multiply two matrices in the following figure.

Figure 2: matrix multiplication

Following the textbook approach, we can calculate the four scalars in the output matrix one by one. That is, the first output scalar is the inner product of the first row of matrix x with the first column of matrix y. Four inner product operations are required.

There’s a more efficient way to do it with outer product. As shown in the following figure, we can instead multiply the first column of matrix x with the first row of matrix y to get the four partial output scalars in one shot. Sum the two partial outputs lead to the result. Only two outer product operations are necessary.

Figure 3: outer product

smmla implements vector level outer product, as the picture below shows. Please note vmmlaq_s32 is the compiler intrinsic implements smmla instruction.

Each input vector (int8x16) is divided into two int8x8 vectors
Calculate inner product of the four int8x8 vector pairs
Store results to the four lanes in the output vector (int32x4)

smmla instruction

Figure 4: smmla instruction

Armed with smmla instruction, we can accelerate matrix multiplication by progressing in two rows and two columns. As explained in the following figure, the computation steps are:

Load two rows (int8x16) from matrix x to vx₀ and vx₁, two columns from y to vy₀ and vy₁.
Zip vx₀, vx₁ to combine bottom halves of the two vectors to one vector, and upper halves to another one. This is necessary for smmla to work correctly. Same for vy₀, vy₁.
Calculate four temporary scalar results with two smmla instructions.
Progress to next blocks and accumulate temporary results until all data is consumed

matrix multiplication with smmla

Figure 5: matrix multiplication with smmla

We optimized Llama.cpp Q6_K and Q4_K matrix multiplication kernels with smmla and observed big performance uplift. Picture below compares Llama.cpp performance before and after Q6_K optimization:

S_TG: speed of token generation, higher the better
S_PP: speed of prompt prefills, higher the better

The test platform is Arm Neoverse-N2.

Arm I8MM improves Llama.cpp Q6_K model performance

Figure 6: Arm I8MM improves Llama.cpp Q6_K model performance

Upstream patches for reference:

By Yibo Cai

Article text

Re-use is only permitted for informational and non-commercial or personal use only.