Overview

This guide shows you how to use Neon intrinsics in your C, or C++, code to take advantage of the Advanced SIMD technology in the Armv8 architecture. The simple examples demonstrate how to use these intrinsics and provide an opportunity to explain their purpose.

Low-level software engineers, library writers, and other developers wanting to use Advanced SIMD technology will find this guide useful.

At the end of this guide there is a Check Your Knowledge section to test whether you have understood the following key concepts:

  • To know what Neon is, and understand the different ways of using Neon.
  • To know the basics of using Neon intrinsics in the C language.
  • To know where to find the Neon intrinsics reference, and the Neon instruction set.

What is Neon?

Neon is the implementation of Arm’s Advanced SIMD architecture. 

The purpose of Neon is to accelerate data manipulation by providing:

  • Thirty-two 128-bit vector registers, each capable of containing multiple lanes of data.
  • SIMD instructions to operate simultaneously on those multiple lanes of data.

Applications that can benefit from Neon technology include multimedia and signal processing, 3D graphics, speech, image processing, or other applications where fixed and floating-point performance is critical.

As a programmer, there are a number of ways you can make use of Neon technology:

  • Neon-enabled open source libraries such as the Arm Compute Library provide one of the easiest ways to take advantage of Neon.
  • Auto-vectorization features in your compiler can automatically optimize your code to take advantage of Neon.
  • Neon intrinsics are function calls that the compiler replaces with appropriate Neon instructions. This gives you direct, low-level access to the exact Neon instructions you want, all from C, or C++ code.
  • For very high performance, hand-coded Neon assembler can be the best approach for experienced programmers.

In this guide we will focus on using the Neon intrinsics for AArch64, but they can be compiled for AArch32 also. For more information about AArch32 Neon see Introducing Neon for Armv8-A. First we will look at a simplified image processing example and matrix multiplication. Then we will move on to a more general discussion about the intrinsics themselves.

Why intrinsics?

Intrinsics are functions whose precise implementation is known to a compiler. The Neon intrinsics are a set of C and C++ functions defined in arm_neon.h which are supported by the Arm compilers and GCC. These functions let you use Neon without having to write assembly code directly, since the functions themselves contain short assembly kernels which are inlined into the calling code. Additionally, register allocation and pipeline optimization are handled by the compiler so many difficulties faced by the assembly programmer are avoided.

See the Neon Intrinsics Reference for a list of all the Neon intrinsics. The Neon intrinsics engineering specification is contained in the Arm C Language Extensions (ACLE).

Using the Neon intrinsics has a number of benefits:

  • Powerful: Intrinsics give the programmer direct access to the Neon instruction set without the need for hand-written assembly code.
  • Portable: Hand-written Neon assembly instructions might need to be re-written for different target processors. C and C++ code containing Neon intrinsics can be compiled for a new target or a new execution state (for example, migrating from AArch32 to AArch64) with minimal or no code changes.
  • Flexible: The programmer can exploit Neon when needed, or use C/C++ when it isn’t, while avoiding many low-level engineering concerns.

However, intrinsics might not be the right choice in all situations:

  • There is a steeper learning curve to use Neon intrinsics than importing a library or relying on a compiler.
  • Hand-optimized assembly code might offer the greatest scope for performance improvement even if it is more difficult to write.

We shall now go through a couple of examples where we re-implement some C functions using Neon intrinsics. The examples chosen do not reflect the full complexity of their application, but they should illustrate the use of intrinsics and act as a starting point for more complex code.

Example: RGB deinterleaving

Consider a 24-bit RGB image where the image is an array of pixels, each with a red, blue, and green element. In memory this could appear as:

This is because the RGB data is interleaved, accessing and manipulating the three separate color channels presents a problem to the programmer. In simple circumstances we could write our own single color channel operations by applying the “modulo 3” to the interleaved RGB values. However, for more complex operations, such as Fourier transforms, it would make more sense to extract and split the channels.

We have an array of RGB values in memory and we want to deinterleave them and place the values in separate color arrays. A C procedure to do this might look like this:

void rgb_deinterleave_c(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) {
    /*
     * Take the elements of "rgb" and store the individual colors "r", "g", and "b".
     */
    for (int i=0; i < len_color; i++) {
        r[i] = rgb[3*i];
        g[i] = rgb[3*i+1];
        b[i] = rgb[3*i+2];
    }
}

But there is an issue. Compiling with Arm Compiler 6 at optimization level -O3 (very high optimization) and examining the disassembly shows no Neon instructions or registers are being used. Each individual 8-bit value is stored in a separate 64-bit general registers. Considering the full width Neon registers are 128 bits wide, which could each hold 16 of our 8-bit values in the example, re-writing the solution to use Neon intrinsics should give us good results.

void rgb_deinterleave_neon(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) {
    /*
     * Take the elements of "rgb" and store the individual colors "r", "g", and "b"
     */
    int num8x16 = len_color / 16;
    uint8x16x3_t intlv_rgb;
    for (int i=0; i < num8x16; i++) {
        intlv_rgb = vld3q_u8(rgb+3*16*i);
        vst1q_u8(r+16*i, intlv_rgb.val[0]);
        vst1q_u8(g+16*i, intlv_rgb.val[1]);
        vst1q_u8(b+16*i, intlv_rgb.val[2]);
    }
}

In this example we have used the following types and intrinsics:

Code element What is it? Why are we using it?
uint8x16_t An array of 16 8-bit unsigned integers. One uint8x16_t fits into a 128-bit register. We can ensure there are no wasted register bits even in C code.
uint8x16x3_t A struct with three uint8x16_t elements. A temporary holding area for the current color values in the loop.
vld3q_u8(…) A function which returns a uint8x16x3_t by loading a contiguous region of 3*16 bytes of memory. Each byte loaded is placed one of the three uint8x16_t arrays in an alternating pattern. At the lowest level, this intrinsic guarantees the generation of an LD3 instruction, which loads the values from a given address into three Neon registers in an alternating pattern.
vst1q_u8(…) A function which stores a uint8x16_t at a given address. It stores a full 128-bit register full of byte values.

Matrix multiplication example

Matrix multiplication is an operation performed in many data intensive applications. It is made up of groups of arithmetic operations which are repeated in a straightforward way:

Neon Optimizing with C Code Matrix Diagram 

The matrix multiplication process is as follows:

  • A- Take a row in the first matrix
  • B- Perform a dot product of this row with a column from the second matrix
  • C- Store the result in the corresponding row and column of a new matrix

For matrices of 32-bit floats, the multiplication could be written as:

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int j_idx=0; j_idx < m; j_idx++) {
            C[n*j_idx + i_idx] = 0;
            for (int k_idx=0; k_idx < k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

We have assumed a column-major layout of the matrices in memory. That is, an n x m matrix M is represented as an array M_array where Mij = M_array[n*j + i].

This code is suboptimal, since it does not make full use of Neon. We can begin to improve it by using intrinsics, but let’s tackle a simpler problem first by looking at small, fixed-size matrices before moving on to larger matrices.

The following code uses intrinsics to multiply two 4x4 matrices. Since we have a small and fixed number of values to process, all of which can fit into the processor’s Neon registers at once, we can completely unroll the loops.

void matrix_multiply_4x4_neon(float32_t *A, float32_t *B, float32_t *C) {
	// these are the columns A
	float32x4_t A0;
	float32x4_t A1;
	float32x4_t A2;
	float32x4_t A3;
	
	// these are the columns B
	float32x4_t B0;
	float32x4_t B1;
	float32x4_t B2;
	float32x4_t B3;
	
	// these are the columns C
	float32x4_t C0;
	float32x4_t C1;
	float32x4_t C2;
	float32x4_t C3;
	
	A0 = vld1q_f32(A);
	A1 = vld1q_f32(A+4);
	A2 = vld1q_f32(A+8);
	A3 = vld1q_f32(A+12);
	
	// Zero accumulators for C values
	C0 = vmovq_n_f32(0);
	C1 = vmovq_n_f32(0);
	C2 = vmovq_n_f32(0);
	C3 = vmovq_n_f32(0);
	
	// Multiply accumulate in 4x1 blocks, i.e. each column in C
	B0 = vld1q_f32(B);
	C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
	C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
	C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
	C0 = vfmaq_laneq_f32(C0, A3, B0, 3);
	vst1q_f32(C, C0);
	
	B1 = vld1q_f32(B+4);
	C1 = vfmaq_laneq_f32(C1, A0, B1, 0);
	C1 = vfmaq_laneq_f32(C1, A1, B1, 1);
	C1 = vfmaq_laneq_f32(C1, A2, B1, 2);
	C1 = vfmaq_laneq_f32(C1, A3, B1, 3);
	vst1q_f32(C+4, C1);
	
	B2 = vld1q_f32(B+8);
	C2 = vfmaq_laneq_f32(C2, A0, B2, 0);
	C2 = vfmaq_laneq_f32(C2, A1, B2, 1);
	C2 = vfmaq_laneq_f32(C2, A2, B2, 2);
	C2 = vfmaq_laneq_f32(C2, A3, B2, 3);
	vst1q_f32(C+8, C2);
	
	B3 = vld1q_f32(B+12);
	C3 = vfmaq_laneq_f32(C3, A0, B3, 0);
	C3 = vfmaq_laneq_f32(C3, A1, B3, 1);
	C3 = vfmaq_laneq_f32(C3, A2, B3, 2);
	C3 = vfmaq_laneq_f32(C3, A3, B3, 3);
	vst1q_f32(C+12, C3);
}

We have chosen to multiply fixed size 4x4 matrices for a few reasons:

  • Some applications need 4x4 matrices specifically, for example graphics or relativistic physics.
  • The Neon vector registers hold four 32-bit values, so matching the program to the architecture will make it easier to optimize.
  • We can take this 4x4 kernel and use it in a more general one.

Let's summarize the intrinsics that have been used here:

Code element What is it? Why are we using it?
float32x4_t An array of four 32-bit floats. One uint32x4_t fits into a 128-bit register. We can ensure there are no wasted register bits even in C code.
vld1q_f32(…) A function which loads four 32-bit floats into a float32x4_t. To get the matrix values we need from A and B.
vfmaq_lane_f32(…) A function which uses the fused multiply accumulate instruction. Multiplies a float32x4_t value by a single element of another float32x4_t then adds the result to a third float32x4_t before returning the result. Since the matrix row-on-column dot products are a set of multiplications and additions, this operation fits quite naturally.
vst1q_f32(…) A function which stores a float32x4_t at a given address. To store the results after they are calculated.

Now that we can multiply a 4x4 matrix, we can multiply larger matrices by treating them as blocks of 4x4 matrices. A flaw with this approach is that it only works with matrix sizes which are a multiple of four in both dimensions, but by padding any matrix with zeroes you can use this method without changing it.

The code for a more general matrix multiplication is listed below. The structure of the kernel has changed very little, with the addition of loops and address calculations being the major changes. As in the 4x4 kernel we have used unique variable names for the columns of B, even though we could have used one variable and re-loaded. This acts as a hint to the compiler to assign different registers to these variables, which will enable the processor to complete the arithmetic instructions for one column while waiting on the loads for another.

void matrix_multiply_neon(float32_t  *A, float32_t  *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
	/* 
	 * Multiply matrices A and B, store the result in C. 
	 * It is the user's responsibility to make sure the matrices are compatible.
	 */	

	int A_idx;
	int B_idx;
	int C_idx;
	
	// these are the columns of a 4x4 sub matrix of A
	float32x4_t A0;
	float32x4_t A1;
	float32x4_t A2;
	float32x4_t A3;
	
	// these are the columns of a 4x4 sub matrix of B
	float32x4_t B0;
	float32x4_t B1;
	float32x4_t B2;
	float32x4_t B3;
	
	// these are the columns of a 4x4 sub matrix of C
	float32x4_t C0;
	float32x4_t C1;
	float32x4_t C2;
	float32x4_t C3;
	
	for (int i_idx=0; i_idx<n; i_idx+=4 {
for (int j_idx=0; j_idx<m; j_idx+=4){
// zero accumulators before matrix op
c0=vmovq_n_f32(0);
c1=vmovq_n_f32(0);
c2=vmovq_n_f32(0);
c3=vmovq_n_f32(0);
for (int k_idx=0; k_idx<k; k_idx+=4){
// compute base index to 4x4 block
a_idx = i_idx + n*k_idx;
b_idx = k*j_idx k_idx;

// load most current a values in row
A0=vld1q_f32(A+A_idx);
A1=vld1q_f32(A+A_idx+n);
A2=vld1q_f32(A+A_idx+2*n);
A3=vld1q_f32(A+A_idx+3*n);

// multiply accumulate 4x1 blocks, i.e. each column C
B0=vld1q_f32(B+B_idx);
C0=vfmaq_laneq_f32(C0,A0,B0,0);
C0=vfmaq_laneq_f32(C0,A1,B0,1);
C0=vfmaq_laneq_f32(C0,A2,B0,2);
C0=vfmaq_laneq_f32(C0,A3,B0,3);

B1=v1d1q_f32(B+B_idx+k);
C1=vfmaq_laneq_f32(C1,A0,B1,0);
C1=vfmaq_laneq_f32(C1,A1,B1,1);
C1=vfmaq_laneq_f32(C1,A2,B1,2);
C1=vfmaq_laneq_f32(C1,A3,B1,3);

B2=vld1q_f32(B+B_idx+2*k);
C2=vfmaq_laneq_f32(C2,A0,B2,0);
C2=vfmaq_laneq_f32(C2,A1,B2,1);
C2=vfmaq_laneq_f32(C2,A2,B2,2);
C2=vfmaq_laneq_f32(C2,A3,B3,3);

B3=vld1q_f32(B+B_idx+3*k);
C3=vfmaq_laneq_f32(C3,A0,B3,0);
C3=vfmaq_laneq_f32(C3,A1,B3,1);
C3=vfmaq_laneq_f32(C3,A2,B3,2);
C3=vfmaq_laneq_f32(C3,A3,B3,3);
}
//Compute base index for stores
C_idx = n*j_idx + i_idx;
vstlq_f32(C+C_idx, C0);
vstlq_f32(C+C_idx+n,Cl);
vstlq_f32(C+C_idx+2*n,C2);
vstlq_f32(C+C_idx+3*n,C3);
}
}
}

Compiling and disassembling this function, and comparing it with our C function shows:

  • Fewer arithmetic instructions for a given matrix multiplication, since we are leveraging the Advanced SIMD technology with full register packing. Pure C code generally does not do this.
  • FMLA instead of FMUL instructions. As specified by the intrinsics.
  • Fewer loop iterations. When used properly intrinsics allow loops to be unrolled easily.
  • However, there are unnecessary loads and stores due to memory allocation and initialization of data types (for example, float32x4_t) which are not used in the pure C code.

The full source code above can be compiled and disassembled on an Arm machine using the following commands:

gcc -g -o3 matrix.c -o exe_matrix_o3
objdump -d exe_ matrix _o3 > disasm_matrix_o3

If you don't have access to Arm-based hardware, you can use Arm DS-5 Community Edition and the Armv8-A Foundation Platform.

Program conventions

Macros

In order to use the intrinsics the Advanced SIMD architecture must be supported, and some specific instructions may or may not be enabled in any case. When the following macros are defined and equal to 1, the corresponding features are available:

  • __ARM_NEON
    • Advanced SIMD is supported by the compiler
    • Always 1 for AArch64
  • __ARM_NEON_FP
    • Neon floating-point operations are supported
    • Always 1 for AArch64
  • __ARM_FEATURE_CRYPTO
    • Crypto instructions are available.
    • Cryptographic Neon intrinsics are therefore available.
  • __ARM_FEATURE_FMA
    • The fused multiply-accumulate instructions are available.
    • Neon intrinsics which use these are therefore available.

This list is not exhaustive and further macros are detailed in the Arm C Language Extensions document.

Types

There are three major categories of data type available in arm_neon.h which follow these patterns:

baseW_t
Scalar data types
baseWxL_t
Vector data types
baseWxLxN_t
Vector array data types

Where:

  • base refers to the fundamental data type.
  • W is the width of the fundamental type.
  • L is the number of scalar data type instances in a vector data type, for example an array of scalars.
  • N is the number of vector data type instances in a vector array type, for example a struct of arrays of scalars.

Generally W and L are such that the vector data types are 64 or 128 bits long, and so fit completely into a Neon register. N corresponds with those instructions which operate on multiple registers at once.

In our earlier code we encountered an example of all three:

  • uint8_t
  • uint8x16_t
  • uint8x16x3_t

Functions

As per the Arm C Language Extensions, the function prototypes from arm_neon.h follow a common pattern. At the most general level this is:

ret v[p][q][r]name[u][n][q][x][_high][_lane | laneq][_n][_result]_type(args)

Be wary that some of the letters and names are overloaded, but in the order above:

ret
the return type of the function.
v
short for vector and is present on all the intrinsics.
p
indicates a pairwise operation. ([value] means value may be present).
q
indicates a saturating operation (with the exception of vqtb[l][x] in AArch64 operations where the q indicates 128-bit index and result operands).
r
indicates a rounding operation.
name
the descriptive name of the basic operation. Often this is an Advanced SIMD instruction, but it does not have to be.
u
indicates signed-to-unsigned saturation.
n
indicates a narrowing operation.
q
postfixing the name indicates an operation on 128-bit vectors.
x
indicates an Advanced SIMD scalar operation in AArch64. It can be one of b, h, s or d (that is, 8, 16, 32, or 64 bits).
_high
In AArch64, used for widening and narrowing operations involving 128-bit operands. For widening 128-bit operands, high refers to the top 64-bits of the source operand(s). For narrowing, it refers to the top 64-bits of the destination operand.
_n
indicates a scalar operand supplied as an argument.
_lane
indicates a scalar operand taken from the lane of a vector. _laneq indicates a scalar operand taken from the lane of an input vector of 128-bit width. ( left | right means only left or right would appear).
type
the primary operand type in short form.
args
the function's arguments.

Check your knowledge

 

Related information

The Neon Intrinsics Reference provides a searchable reference of the functions specified by the ACLE.

The Architecture Exploration Tools let you investigate the Advanced SIMD instruction set.

The Arm Architecture Reference Manual provides a complete specification of the Advanced SIMD instruction set.

Useful links to training: