Overview
This guide shows you how to use Neon intrinsics in your C, or C++, code to take advantage of the Advanced SIMD technology in the Armv8 architecture. The simple examples demonstrate how to use these intrinsics and provide an opportunity to explain their purpose.
Lowlevel software engineers, library writers, and other developers wanting to use Advanced SIMD technology will find this guide useful.
At the end of this guide there is a Check Your Knowledge section to test whether you have understood the following key concepts:
 To know what Neon is, and understand the different ways of using Neon.
 To know the basics of using Neon intrinsics in the C language.
 To know where to find the Neon intrinsics reference, and the Neon instruction set.
What is Neon?
Neon is the implementation of Arm’s Advanced SIMD architecture.
The purpose of Neon is to accelerate data manipulation by providing:
 Thirtytwo 128bit vector registers, each capable of containing multiple lanes of data.
 SIMD instructions to operate simultaneously on those multiple lanes of data.
Applications that can benefit from Neon technology include multimedia and signal processing, 3D graphics, speech, image processing, or other applications where fixed and floatingpoint performance is critical.
As a programmer, there are a number of ways you can make use of Neon technology:
 Neonenabled open source libraries such as the Arm Compute Library provide one of the easiest ways to take advantage of Neon.
 Autovectorization features in your compiler can automatically optimize your code to take advantage of Neon.
 Neon intrinsics are function calls that the compiler replaces with appropriate Neon instructions. This gives you direct, lowlevel access to the exact Neon instructions you want, all from C, or C++ code.
 For very high performance, handcoded Neon assembler can be the best approach for experienced programmers.
In this guide we will focus on using the Neon intrinsics for AArch64, but they can be compiled for AArch32 also. For more information about AArch32 Neon see Introducing Neon for Armv8A. First we will look at a simplified image processing example and matrix multiplication. Then we will move on to a more general discussion about the intrinsics themselves.
Why intrinsics?
Intrinsics are functions whose precise implementation is known to a compiler. The Neon intrinsics are a set of C and C++ functions defined in arm_neon.h
which are supported by the Arm compilers and GCC. These functions let you use Neon without having to write assembly code directly, since the functions themselves contain short assembly kernels which are inlined into the calling code. Additionally, register allocation and pipeline optimization are handled by the compiler so many difficulties faced by the assembly programmer are avoided.
See the Neon Intrinsics Reference for a list of all the Neon intrinsics. The Neon intrinsics engineering specification is contained in the Arm C Language Extensions (ACLE).
Using the Neon intrinsics has a number of benefits:
 Powerful: Intrinsics give the programmer direct access to the Neon instruction set without the need for handwritten assembly code.
 Portable: Handwritten Neon assembly instructions might need to be rewritten for different target processors. C and C++ code containing Neon intrinsics can be compiled for a new target or a new execution state (for example, migrating from AArch32 to AArch64) with minimal or no code changes.
 Flexible: The programmer can exploit Neon when needed, or use C/C++ when it isn’t, while avoiding many lowlevel engineering concerns.
However, intrinsics might not be the right choice in all situations:
 There is a steeper learning curve to use Neon intrinsics than importing a library or relying on a compiler.
 Handoptimized assembly code might offer the greatest scope for performance improvement even if it is more difficult to write.
We shall now go through a couple of examples where we reimplement some C functions using Neon intrinsics. The examples chosen do not reflect the full complexity of their application, but they should illustrate the use of intrinsics and act as a starting point for more complex code.
Example: RGB deinterleaving
Consider a 24bit RGB image where the image is an array of pixels, each with a red, blue, and green element. In memory this could appear as:
This is because the RGB data is interleaved, accessing and manipulating the three separate color channels presents a problem to the programmer. In simple circumstances we could write our own single color channel operations by applying the “modulo 3” to the interleaved RGB values. However, for more complex operations, such as Fourier transforms, it would make more sense to extract and split the channels.
We have an array of RGB values in memory and we want to deinterleave them and place the values in separate color arrays. A C procedure to do this might look like this:
void rgb_deinterleave_c(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) { /* * Take the elements of "rgb" and store the individual colors "r", "g", and "b". */ for (int i=0; i < len_color; i++) { r[i] = rgb[3*i]; g[i] = rgb[3*i+1]; b[i] = rgb[3*i+2]; } }
But there is an issue. Compiling with Arm Compiler 6 at optimization level O3 (very high optimization) and examining the disassembly shows no Neon instructions or registers are being used. Each individual 8bit value is stored in a separate 64bit general registers. Considering the full width Neon registers are 128 bits wide, which could each hold 16 of our 8bit values in the example, rewriting the solution to use Neon intrinsics should give us good results.
void rgb_deinterleave_neon(uint8_t *r, uint8_t *g, uint8_t *b, uint8_t *rgb, int len_color) { /* * Take the elements of "rgb" and store the individual colors "r", "g", and "b" */ int num8x16 = len_color / 16; uint8x16x3_t intlv_rgb; for (int i=0; i < num8x16; i++) { intlv_rgb = vld3q_u8(rgb+3*16*i); vst1q_u8(r+16*i, intlv_rgb.val[0]); vst1q_u8(g+16*i, intlv_rgb.val[1]); vst1q_u8(b+16*i, intlv_rgb.val[2]); } }
In this example we have used the following types and intrinsics:
Code element  What is it?  Why are we using it? 

uint8x16_t 
An array of 16 8bit unsigned integers.  One uint8x16_t fits into a 128bit register. We can ensure there are no wasted register bits even in C code. 
uint8x16x3_t 
A struct with three uint8x16_t elements. 
A temporary holding area for the current color values in the loop. 
vld3q_u8(…) 
A function which returns a uint8x16x3_t by loading a contiguous region of 3*16 bytes of memory. Each byte loaded is placed one of the three uint8x16_t arrays in an alternating pattern. 
At the lowest level, this intrinsic guarantees the generation of an LD3 instruction, which loads the values from a given address into three Neon registers in an alternating pattern. 
vst1q_u8(…) 
A function which stores a uint8x16_t at a given address. 
It stores a full 128bit register full of byte values. 
The full source code above can be compiled and disassembled on an Arm machine using the following commands:
gcc g o3 rgb.c o exe_rgb_o3 objdump d exe_rgb_o3 > disasm_rgb_o3
If you don't have access to Armbased hardware, you can use Arm DS5 Community Edition and the Armv8A Foundation Platform.
Matrix multiplication example
Matrix multiplication is an operation performed in many data intensive applications. It is made up of groups of arithmetic operations which are repeated in a straightforward way:
The matrix multiplication process is as follows:
 A Take a row in the first matrix
 B Perform a dot product of this row with a column from the second matrix
 C Store the result in the corresponding row and column of a new matrix
For matrices of 32bit floats, the multiplication could be written as:
void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) { for (int i_idx=0; i_idx < n; i_idx++) { for (int j_idx=0; j_idx < m; j_idx++) { C[n*j_idx + i_idx] = 0; for (int k_idx=0; k_idx < k; k_idx++) { C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx]; } } } }
We have assumed a columnmajor layout of the matrices in memory. That is, an n
x m
matrix M
is represented as an array M_array
where M_{ij}
= M_array[n*j + i]
.
This code is suboptimal, since it does not make full use of Neon. We can begin to improve it by using intrinsics, but let’s tackle a simpler problem first by looking at small, fixedsize matrices before moving on to larger matrices.
The following code uses intrinsics to multiply two 4x4
matrices. Since we have a small and fixed number of values to process, all of which can fit into the processor’s Neon registers at once, we can completely unroll the loops.
void matrix_multiply_4x4_neon(float32_t *A, float32_t *B, float32_t *C) { // these are the columns A float32x4_t A0; float32x4_t A1; float32x4_t A2; float32x4_t A3; // these are the columns B float32x4_t B0; float32x4_t B1; float32x4_t B2; float32x4_t B3; // these are the columns C float32x4_t C0; float32x4_t C1; float32x4_t C2; float32x4_t C3; A0 = vld1q_f32(A); A1 = vld1q_f32(A+4); A2 = vld1q_f32(A+8); A3 = vld1q_f32(A+12); // Zero accumulators for C values C0 = vmovq_n_f32(0); C1 = vmovq_n_f32(0); C2 = vmovq_n_f32(0); C3 = vmovq_n_f32(0); // Multiply accumulate in 4x1 blocks, i.e. each column in C B0 = vld1q_f32(B); C0 = vfmaq_laneq_f32(C0, A0, B0, 0); C0 = vfmaq_laneq_f32(C0, A1, B0, 1); C0 = vfmaq_laneq_f32(C0, A2, B0, 2); C0 = vfmaq_laneq_f32(C0, A3, B0, 3); vst1q_f32(C, C0); B1 = vld1q_f32(B+4); C1 = vfmaq_laneq_f32(C1, A0, B1, 0); C1 = vfmaq_laneq_f32(C1, A1, B1, 1); C1 = vfmaq_laneq_f32(C1, A2, B1, 2); C1 = vfmaq_laneq_f32(C1, A3, B1, 3); vst1q_f32(C+4, C1); B2 = vld1q_f32(B+8); C2 = vfmaq_laneq_f32(C2, A0, B2, 0); C2 = vfmaq_laneq_f32(C2, A1, B2, 1); C2 = vfmaq_laneq_f32(C2, A2, B2, 2); C2 = vfmaq_laneq_f32(C2, A3, B2, 3); vst1q_f32(C+8, C2); B3 = vld1q_f32(B+12); C3 = vfmaq_laneq_f32(C3, A0, B3, 0); C3 = vfmaq_laneq_f32(C3, A1, B3, 1); C3 = vfmaq_laneq_f32(C3, A2, B3, 2); C3 = vfmaq_laneq_f32(C3, A3, B3, 3); vst1q_f32(C+12, C3); }
We have chosen to multiply fixed size 4x4
matrices for a few reasons:
 Some applications need
4x4
matrices specifically, for example graphics or relativistic physics.  The Neon vector registers hold four 32bit values, so matching the program to the architecture will make it easier to optimize.
 We can take this
4x4
kernel and use it in a more general one.
Let's summarize the intrinsics that have been used here:
Code element  What is it?  Why are we using it? 

float32x4_t 
An array of four 32bit floats.  One uint32x4_t fits into a 128bit register. We can ensure there are no wasted register bits even in C code. 
vld1q_f32(…) 
A function which loads four 32bit floats into a float32x4_t . 
To get the matrix values we need from A and B. 
vfmaq_lane_f32(…) 
A function which uses the fused multiply accumulate instruction. Multiplies a float32x4_t value by a single element of another float32x4_t then adds the result to a third float32x4_t before returning the result. 
Since the matrix rowoncolumn dot products are a set of multiplications and additions, this operation fits quite naturally. 
vst1q_f32(…) 
A function which stores a float32x4_t at a given address. 
To store the results after they are calculated. 
Now that we can multiply a 4x4
matrix, we can multiply larger matrices by treating them as blocks of 4x4
matrices. A flaw with this approach is that it only works with matrix sizes which are a multiple of four in both dimensions, but by padding any matrix with zeroes you can use this method without changing it.
The code for a more general matrix multiplication is listed below. The structure of the kernel has changed very little, with the addition of loops and address calculations being the major changes. As in the
void matrix_multiply_neon(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) { /* * Multiply matrices A and B, store the result in C. * It is the user's responsibility to make sure the matrices are compatible. */ int A_idx; int B_idx; int C_idx; // these are the columns of a 4x4 sub matrix of A float32x4_t A0; float32x4_t A1; float32x4_t A2; float32x4_t A3; // these are the columns of a 4x4 sub matrix of B float32x4_t B0; float32x4_t B1; float32x4_t B2; float32x4_t B3; // these are the columns of a 4x4 sub matrix of C float32x4_t C0; float32x4_t C1; float32x4_t C2; float32x4_t C3; for (int i_idx=0; i_idx<n; i_idx+=4 {
for (int j_idx=0; j_idx<m; j_idx+=4){
// zero accumulators before matrix op
c0=vmovq_n_f32(0);
c1=vmovq_n_f32(0);
c2=vmovq_n_f32(0);
c3=vmovq_n_f32(0);
for (int k_idx=0; k_idx<k; k_idx+=4){
// compute base index to 4x4 block
a_idx = i_idx + n*k_idx;
b_idx = k*j_idx k_idx;
// load most current a values in row
A0=vld1q_f32(A+A_idx);
A1=vld1q_f32(A+A_idx+n);
A2=vld1q_f32(A+A_idx+2*n);
A3=vld1q_f32(A+A_idx+3*n);
// multiply accumulate 4x1 blocks, i.e. each column C
B0=vld1q_f32(B+B_idx);
C0=vfmaq_laneq_f32(C0,A0,B0,0);
C0=vfmaq_laneq_f32(C0,A1,B0,1);
C0=vfmaq_laneq_f32(C0,A2,B0,2);
C0=vfmaq_laneq_f32(C0,A3,B0,3);
B1=v1d1q_f32(B+B_idx+k);
C1=vfmaq_laneq_f32(C1,A0,B1,0);
C1=vfmaq_laneq_f32(C1,A1,B1,1);
C1=vfmaq_laneq_f32(C1,A2,B1,2);
C1=vfmaq_laneq_f32(C1,A3,B1,3);
B2=vld1q_f32(B+B_idx+2*k);
C2=vfmaq_laneq_f32(C2,A0,B2,0);
C2=vfmaq_laneq_f32(C2,A1,B2,1);
C2=vfmaq_laneq_f32(C2,A2,B2,2);
C2=vfmaq_laneq_f32(C2,A3,B3,3);
B3=vld1q_f32(B+B_idx+3*k);
C3=vfmaq_laneq_f32(C3,A0,B3,0);
C3=vfmaq_laneq_f32(C3,A1,B3,1);
C3=vfmaq_laneq_f32(C3,A2,B3,2);
C3=vfmaq_laneq_f32(C3,A3,B3,3);
}
//Compute base index for stores
C_idx = n*j_idx + i_idx;
vstlq_f32(C+C_idx, C0);
vstlq_f32(C+C_idx+n,Cl);
vstlq_f32(C+C_idx+2*n,C2);
vstlq_f32(C+C_idx+3*n,C3);
}
}
}
Compiling and disassembling this function, and comparing it with our C function shows:
 Fewer arithmetic instructions for a given matrix multiplication, since we are leveraging the Advanced SIMD technology with full register packing. Pure C code generally does not do this.
FMLA
instead ofFMUL
instructions. As specified by the intrinsics. Fewer loop iterations. When used properly intrinsics allow loops to be unrolled easily.
 However, there are unnecessary loads and stores due to memory allocation and initialization of data types (for example,
float32x4_t
) which are not used in the pure C code.
The full source code above can be compiled and disassembled on an Arm machine using the following commands:
gcc g o3 matrix.c o exe_matrix_o3 objdump d exe_ matrix _o3 > disasm_matrix_o3
If you don't have access to Armbased hardware, you can use Arm DS5 Community Edition and the Armv8A Foundation Platform.
Program conventions
Macros
In order to use the intrinsics the Advanced SIMD architecture must be supported, and some specific instructions may or may not be enabled in any case. When the following macros are defined and equal to 1, the corresponding features are available:

__ARM_NEON
 Advanced SIMD is supported by the compiler
 Always 1 for AArch64

__ARM_NEON_FP
 Neon floatingpoint operations are supported
 Always 1 for AArch64

__ARM_FEATURE_CRYPTO
 Crypto instructions are available.
 Cryptographic Neon intrinsics are therefore available.

__ARM_FEATURE_FMA
 The fused multiplyaccumulate instructions are available.
 Neon intrinsics which use these are therefore available.
This list is not exhaustive and further macros are detailed in the Arm C Language Extensions document.
Types
There are three major categories of data type available in arm_neon.h
which follow these patterns:
baseW_t
 Scalar data types
baseWxL_t
 Vector data types
baseWxLxN_t
 Vector array data types
Where:
base
refers to the fundamental data type.W
is the width of the fundamental type.L
is the number of scalar data type instances in a vector data type, for example an array of scalars.N
is the number of vector data type instances in a vector array type, for example a struct of arrays of scalars.
Generally W
and L
are such that the vector data types are 64 or 128 bits long, and so fit completely into a Neon register. N
corresponds with those instructions which operate on multiple registers at once.
In our earlier code we encountered an example of all three:
uint8_t
uint8x16_t
uint8x16x3_t
Functions
As per the Arm C Language Extensions, the function prototypes from arm_neon.h
follow a common pattern. At the most general level this is:
ret v[p][q][r]name[u][n][q][x][_high][_lane  laneq][_n][_result]_type(args)
Be wary that some of the letters and names are overloaded, but in the order above:
ret
 the return type of the function.
v
 short for
vector
and is present on all the intrinsics. p
 indicates a pairwise operation. (
[value]
meansvalue
may be present). q
 indicates a saturating operation (with the exception of
vqtb[l][x]
in AArch64 operations where theq
indicates 128bit index and result operands). r
 indicates a rounding operation.
name
 the descriptive name of the basic operation. Often this is an Advanced SIMD instruction, but it does not have to be.
u
 indicates signedtounsigned saturation.
n
 indicates a narrowing operation.
q
 postfixing the name indicates an operation on 128bit vectors.
x
 indicates an Advanced SIMD scalar operation in AArch64. It can be one of
b
,h
,s
ord
(that is, 8, 16, 32, or 64 bits). _high
 In AArch64, used for widening and narrowing operations involving 128bit operands. For widening 128bit operands,
high
refers to the top 64bits of the source operand(s). For narrowing, it refers to the top 64bits of the destination operand. _n
 indicates a scalar operand supplied as an argument.
_lane
 indicates a scalar operand taken from the lane of a vector.
_laneq
indicates a scalar operand taken from the lane of an input vector of 128bit width. (left  right
means onlyleft
orright
would appear). type
 the primary operand type in short form.
args
 the function's arguments.
Check your knowledge
Related information
Engineering specifications for the Neon intrinsics can be found in the Arm C Language Extensions (ACLE).
The Neon Intrinsics Reference provides a searchable reference of the functions specified by the ACLE.
The Architecture Exploration Tools let you investigate the Advanced SIMD instruction set.
The Arm Architecture Reference Manual provides a complete specification of the Advanced SIMD instruction set.