Optimizing C Code with Neon Intrinsics
This guide shows you how to use Neon intrinsics in your C, or C++, code to take advantage of the Advanced SIMD technology in the Arm®v8 architecture. The simple example demonstrates how to use the intrinsics and provides an opportunity to explain their purpose.
At the end of the topic, there is a Quick reference section to summarize the following key concepts:
What is Neon and how can it be used?
What are the basics of using Neon intrinsics in the C language.
What is Neon?
Neon is the implementation of the Arm Advanced SIMD architecture.
The purpose of Neon is to accelerate data manipulation by providing:
32 128bit vector registers, each capable of containing multiple lanes of data.
SIMD instructions to operate simultaneously on those multiple lanes of data.
Applications that can benefit from Neon technology include multimedia and signal processing, 3D graphics, scientific simulations, image processing, or other applications where fixed and floatingpoint performance is critical.
As an application developer, there are a number of ways you can make use of Neon technology:
Neonenabled open source libraries such as the Arm Compute Library or Ne10 provide one of the easiest ways to take advantage of Neon.
Autovectorization features in your compiler can automatically optimize your code to take advantage of Neon.
Neon intrinsics are function calls that the compiler replaces with appropriate Neon instructions. The intrinsics give you direct, lowlevel access to the exact Neon instructions you want, from C, or C++ code.
For very high performance, handcoded Neon assembler can be the best approach for experienced developers.
In this guide the focus is on using the Neon intrinsics for AArch64.
Why intrinsics?
Intrinsics are functions whose precise implementation is known to a compiler. The Neon intrinsics are a set of C and C++ functions defined
in arm_neon.h
which are supported by the Arm compilers and GCC. These functions let you use Neon without having to write assembly code because the functions themselves contain short assembly kernels, which are inlined into the calling code. In addition, register allocation and pipeline optimization are handled by the compiler so many difficulties faced by the assembly developer are avoided.
Fr a list of all the Neon intrinsics, see the Neon Intrinsics Reference. The Neon intrinsics engineering specification is contained in the Arm C Language Extensions (ACLE).
Using Neon intrinsics has a number of benefits:
Powerful: Intrinsics give the developer direct access to the Neon instruction set, without the need for handwritten assembly code.
Portable: Handwritten Neon assembly instructions might need to be rewritten for different target processors. C and C++ code containing Neon intrinsics can be compiled for a new target or a new execution state with minimal or no code changes.
Flexible: The developer can exploit Neon when needed, or use C/C++ when it is not, while avoiding many lowlevel engineering concerns.
However, intrinsics might not be the right choice in all situations:
There is a steeper learning curve to use Neon intrinsics than importing a library or relying on a compiler.
Handoptimized assembly code might offer the greatest scope for performance improvement even if it is more difficult to write.
Example: Matrix multiplication
This example reimplements some C functions using Neon intrinsics. The example chosen does not reflect the full complexity of their application, but illustrates the use of intrinsics and is a starting point for more complex code.
Matrix multiplication is an operation performed in many data intensive applications. It is made up of groups of arithmetic operations which are repeated in a straightforward way:
The matrix multiplication process is as follows:
Take a row in the first matrix  'A'
Perform a dot product of this row with a column from the second matrix  'B'
Store the result in the corresponding row and column of a new matrix  'C'
For matrices of 32bit floats, the multiplication could be written as:
void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
for (int i_idx=0; i_idx < n; i_idx++) {
for (int j_idx=0; j_idx < m; j_idx++) {
C[n*j_idx + i_idx] = 0;
for (int k_idx=0; k_idx < k; k_idx++) {
C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
}
}
}
}
Assume a columnmajor layout of the matrices in memory. That is, an n
x m
matrix M
, is represented as an array M_array
, where Mij
= M_array[n*j + i]
.
This code is suboptimal, because it does not make full use of Neon. Intrinsics can be used to improve it.
In this example, we will examine small, fixedsize matrices before moving on to larger matrices.
The following code uses intrinsics to multiply two 4x4 matrices. Since there is a small, fixed number of values to process, all of which can fit into the Neon registers of the processor at once, the loops can be completely unrolled.
void matrix_multiply_4x4_neon(float32_t *A, float32_t *B, float32_t *C) {
// these are the columns A
float32x4_t A0;
float32x4_t A1;
float32x4_t A2;
float32x4_t A3;
// these are the columns B
float32x4_t B0;
float32x4_t B1;
float32x4_t B2;
float32x4_t B3;
// these are the columns C
float32x4_t C0;
float32x4_t C1;
float32x4_t C2;
float32x4_t C3;
A0 = vld1q_f32(A);
A1 = vld1q_f32(A+4);
A2 = vld1q_f32(A+8);
A3 = vld1q_f32(A+12);
// Zero accumulators for C values
C0 = vmovq_n_f32(0);
C1 = vmovq_n_f32(0);
C2 = vmovq_n_f32(0);
C3 = vmovq_n_f32(0);
// Multiply accumulate in 4x1 blocks, i.e. each column in C
B0 = vld1q_f32(B);
C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
C0 = vfmaq_laneq_f32(C0, A3, B0, 3);
vst1q_f32(C, C0);
B1 = vld1q_f32(B+4);
C1 = vfmaq_laneq_f32(C1, A0, B1, 0);
C1 = vfmaq_laneq_f32(C1, A1, B1, 1);
C1 = vfmaq_laneq_f32(C1, A2, B1, 2);
C1 = vfmaq_laneq_f32(C1, A3, B1, 3);
vst1q_f32(C+4, C1);
B2 = vld1q_f32(B+8);
C2 = vfmaq_laneq_f32(C2, A0, B2, 0);
C2 = vfmaq_laneq_f32(C2, A1, B2, 1);
C2 = vfmaq_laneq_f32(C2, A2, B2, 2);
C2 = vfmaq_laneq_f32(C2, A3, B2, 3);
vst1q_f32(C+8, C2);
B3 = vld1q_f32(B+12);
C3 = vfmaq_laneq_f32(C3, A0, B3, 0);
C3 = vfmaq_laneq_f32(C3, A1, B3, 1);
C3 = vfmaq_laneq_f32(C3, A2, B3, 2);
C3 = vfmaq_laneq_f32(C3, A3, B3, 3);
vst1q_f32(C+12, C3);
}
Fixedsize 4x4 matrices are chosen because:
Some applications need 4x4 matrices specifically, for example: graphics or relativistic physics.
The Neon vector registers hold four 32bit values. Matching the program to the architecture makes it easier to optimize.
This 4x4 kernel can be used in a more general kernel.
Summarizing the intrinsics that have been used here:
Code element 
What is it? 
Whay are we using it? 


An array of four 32bit floats. 
One 

A function which loads four 32bit floats into a 
To get the matrix values we need from A and B. 

A function which uses the fused multiply accumulate instruction. Multiplies a 
Since the matrix rowoncolumn dot products are a set of multiplications and additions, this operation fits quite naturally. 

A function which stores a 
To store the results after they are calculated. 
To multiply larger matrices, treat them as blocks of 4x4 matrices. However, this approach only works with matrix sizes which are a multiple of four in both dimensions. To use use this method without changing it, pad the matrix with zeroes.
The code for a more general matrix multiplication is listed below. The structure of the kernel has changed very little, with the addition of loops and address calculations being the major changes. Like in the 4x4 kernel, unique variable names are used for the B columns. The alternative would be to use one variable and reload it. This acts as a hint to the compiler to assign different registers to these variables. Assigning different registers enables the processor to complete the arithmetic instructions for one column, while waiting on the loads for another.
void matrix_multiply_neon(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
/*
* Multiply matrices A and B, store the reult in C.
* It is the users responsibility to make sure the matrices are compatible.
*/
int A_idx;
int B_idx;
int C_idx;
// these are the columns of a 4x4 sub matrix of A
float32x4_t A0;
float32x4_t A1;
float32x4_t A2;
float32x4_t A3;
// these are the columns of a 4x4 sub matrix of B
float32x4_t B0;
float32x4_t B1;
float32x4_t B2;
float32x4_t B3;
// these are the columns of a 4x4 sub matrix of C
float32x4_t C0;
float32x4_t C1;
float32x4_t C2;
float32x4_t C3;
for (int i_idx=0; i_idx<n; i_idx+=4 {
for (int j_idx=0; j_idx<m; j_idx+=4){
// zero accumulators before matrix op
c0=vmovq_n_f32(0);
c1=vmovq_n_f32(0);
c2=vmovq_n_f32(0);
c3=vmovq_n_f32(0);
for (int k_idx=0; k_idx<k; k_idx+=4){
// compute base index to 4x4 block
a_idx = i_idx + n*k_idx;
b_idx = k*j_idx k_idx;
// load most current a values in row
A0=vld1q_f32(A+A_idx);
A1=vld1q_f32(A+A_idx+n);
A2=vld1q_f32(A+A_idx+2*n);
A3=vld1q_f32(A+A_idx+3*n);
// multiply accumulate 4x1 blocks, i.e. each column C
B0=vld1q_f32(B+B_idx);
C0=vfmaq_laneq_f32(C0,A0,B0,0);
C0=vfmaq_laneq_f32(C0,A1,B0,1);
C0=vfmaq_laneq_f32(C0,A2,B0,2);
C0=vfmaq_laneq_f32(C0,A3,B0,3);
B1=v1d1q_f32(B+B_idx+k);
C1=vfmaq_laneq_f32(C1,A0,B1,0);
C1=vfmaq_laneq_f32(C1,A1,B1,1);
C1=vfmaq_laneq_f32(C1,A2,B1,2);
C1=vfmaq_laneq_f32(C1,A3,B1,3);
B2=vld1q_f32(B+B_idx+2*k);
C2=vfmaq_laneq_f32(C2,A0,B2,0);
C2=vfmaq_laneq_f32(C2,A1,B2,1);
C2=vfmaq_laneq_f32(C2,A2,B2,2);
C2=vfmaq_laneq_f32(C2,A3,B3,3);
B3=vld1q_f32(B+B_idx+3*k);
C3=vfmaq_laneq_f32(C3,A0,B3,0);
C3=vfmaq_laneq_f32(C3,A1,B3,1);
C3=vfmaq_laneq_f32(C3,A2,B3,2);
C3=vfmaq_laneq_f32(C3,A3,B3,3);
}
//Compute base index for stores
C_idx = n*j_idx + i_idx;
vstlq_f32(C+C_idx, C0);
vstlq_f32(C+C_idx+n,Cl);
vstlq_f32(C+C_idx+2*n,C2);
vstlq_f32(C+C_idx+3*n,C3);
}
}
}
Compiling and disassembling this function, and comparing it with the C function shows:
Fewer arithmetic instructions for a given matrix multiplication, because we are leveraging the Advanced SIMD technology with full register packing. Typical C code, generally, does not do this.
FMLA
instead ofFMUL
instructions. As specified by the intrinsics.Fewer loop iterations. When used properly intrinsics allow loops to be unrolled easily.
However, there are unnecessary loads and stores due to memory allocation and initialization of data types (for example, float32x4_t
) which are not used in the pure C code.
Program conventions
Macros
In order to use the intrinsics, the Advanced SIMD architecture must be supported. Some specific instructions might not be enabled. When the following macros are defined and equal to 1, the corresponding features are available:
__aarch64__
Selection of architecture dependent source at compile time.
Always 1 for AArch64.
_ARM_NEON
Advanced SIMD is supported by the compiler.
Always 1 for AArch64.
_ARM_NEON_FP
Neon floatingpoint operations are supported.
Always 1 for AArch64
_ARM_FEATURE_CRYPTO
Crypto instructions are available.
Cryptographic Neon intrinsics are therefore available.
_ARM_FEATURE_FMA
The fused multiplyaccumulate instructions are available.
Neon intrinsics which use these are therefore available.
This list is not exhaustive and further macros are detailed in the Arm C Language Extensions document.
Types
There are three major categories of data type available in arm_neon.h
which follow these patterns:
baseW_t
Scalar data types
baseWxL_t
Vector data types
baseWxLxN_t
Vector array data types
Where:
base
refers to the fundamental data type.W
is the width of the fundamental type.L
is the number of scalar data type instances in a vector data type, for example an array of scalars.N
is the number of vector data type instances in a vector array type, for example a struct of arrays of scalars.
Generally, W
and L
are such that the vector data types are 64 or 128 bits long, and so fit completely into a Neon register. N
corresponds with those instructions which operate on multiple registers at once.
Functions
As per the Arm C Language Extensions, the function prototypes from arm_neon.h
follow a common pattern. At the most general level this is:
ret v[p][q][r]name[u][n][q][x][_high][_lane  laneq][_n][_result]_type(args)
Some of the letters and names are overloaded, but in the order above:
ret
The return type of the function.
v
Short for
vector
and is present on all the intrinsics.p
Indicates a pairwise operation. (
[value]
meansvalue
may be present).q
Indicates a saturating operation (with the exception of
vqtb[l][x]
in AArch64 operations where theq
indicates 128bit index and result operands).r
Indicates a rounding operation.
name
The descriptive name of the basic operation. Often, this is an Advanced SIMD instruction, but it does not have to be.
u
Indicates signedtounsigned saturation.
n
Indicates a narrowing operation.
q
Postfixing the name indicates an operation on 128bit vectors.
x
Indicates an Advanced SIMD scalar operation in AArch64. It can be one of
b
,h
,s
, ord
(that is, 8, 16, 32, or 64 bits)._high
In AArch64, used for widening and narrowing operations involving 128bit operands. For widening 128bit operands,
high
refers to the top 64bits of the source operand (or operands). For narrowing, it refers to the top 64bits of the destination operand._n
Indicates a scalar operand supplied as an argument.
_lane
Indicates a scalar operand taken from the lane of a vector.
_laneq
indicates a scalar operand taken from the lane of an input vector of 128bit width. (left  right
means onlyleft
orright
would appear).type
The primary operand type in short form.
args
The arguments of the function.
Quick reference
What is Neon?
Neon is the implementation of the Advanced SIMD extension to the Arm architecture.
All processors compliant with the Arm®v8A architecture (for example, the CortexA76 or CortexA57) include Neon. In the developer's view, Neon provides an additional 32 128bit registers with instructions that operate on 8, 16, 32, or 64 bit lanes within these registers.
Which header file must you include in a C file in order to use the Neon intrinsics?
arm_neon.h
#include <arm_neon.h>
must appear before the use of any Neon intrinsics.
What do the data types float64_t, poly64x2_t, and int8x8x3_t represent?
float64_t
is a scalar type which is a 64bit floatingpoint type.poly64x2_t
is a vector type of two 64bit polynomial scalars.int8x8x3_t
is a vector array type of three vectors of eight 8bit signed integers.
What does the int8x16_t vmulq_s8 (int8x16_t a, int8x16_t b) function do?
The mul
in the function name indicates that this intrinsic uses the MUL
instruction. The types of the arguments and return value (sixteen bytes of signed integers) inform you that this intrinsic maps to the following instruction:
MUL Vd.16B, Vn.16B, Vm.16B
This function multiplies corresponding elements of a
and b
, and returns the result.
The deinterleave function defined in this tutorial can only operate on blocks of sixteen 8 bit unsigned integers. If you had an array of uint8_t values that was not a multiple of sixteen in length, how might you account for this while: 1) Changing the arrays, but not the function? and 2) Changing the function, but not the arrays?
Padding the arrays with zeros would be the simplest option, but padding might have to be accounted for in other functions.
One method would be to use the Neon deinterleave for every whole multiple of sixteen values, and then use the C deinterleave for the remainder.
Related information
Engineering specifications for the Neon intrinsics can be found in the Arm C Language Extensions (ACLE).
The Neon Intrinsics Reference provides a searchable reference of the functions specified by the ACLE.
The Architecture Exploration Tools let you investigate the Advanced SIMD instruction set.
The Arm Architecture Reference Manual provides a complete specification of the Advanced SIMD instruction set.