You copied the Doc URL to your clipboard.

Optimizing C Code with Neon Intrinsics

This guide shows you how to use Neon intrinsics in your C, or C++, code to take advantage of the Advanced SIMD technology in the Arm®v8 architecture. The simple example demonstrates how to use the intrinsics and provides an opportunity to explain their purpose.

At the end of the topic, there is a Quick reference section to summarize the following key concepts:

  • What is Neon and how can it be used?

  • What are the basics of using Neon intrinsics in the C language.

What is Neon?

Neon is the implementation of the Arm Advanced SIMD architecture.

The purpose of Neon is to accelerate data manipulation by providing:

  • 32 128-bit vector registers, each capable of containing multiple lanes of data.

  • SIMD instructions to operate simultaneously on those multiple lanes of data.

Applications that can benefit from Neon technology include multimedia and signal processing, 3D graphics, scientific simulations, image processing, or other applications where fixed and floating-point performance is critical.

As an application developer, there are a number of ways you can make use of Neon technology:

  • Neon-enabled open source libraries such as the Arm Compute Library or Ne10 provide one of the easiest ways to take advantage of Neon.

  • Auto-vectorization features in your compiler can automatically optimize your code to take advantage of Neon.

  • Neon intrinsics are function calls that the compiler replaces with appropriate Neon instructions. The intrinsics give you direct, low-level access to the exact Neon instructions you want, from C, or C++ code.

  • For very high performance, hand-coded Neon assembler can be the best approach for experienced developers.

In this guide the focus is on using the Neon intrinsics for AArch64.

Why intrinsics?

Intrinsics are functions whose precise implementation is known to a compiler. The Neon intrinsics are a set of C and C++ functions defined in arm_neon.h which are supported by the Arm compilers and GCC. These functions let you use Neon without having to write assembly code because the functions themselves contain short assembly kernels, which are inlined into the calling code. In addition, register allocation and pipeline optimization are handled by the compiler so many difficulties faced by the assembly developer are avoided.

Fr a list of all the Neon intrinsics, see the Neon Intrinsics Reference. The Neon intrinsics engineering specification is contained in the Arm C Language Extensions (ACLE).

Using Neon intrinsics has a number of benefits:

  • Powerful: Intrinsics give the developer direct access to the Neon instruction set, without the need for hand-written assembly code.

  • Portable: Hand-written Neon assembly instructions might need to be re-written for different target processors. C and C++ code containing Neon intrinsics can be compiled for a new target or a new execution state with minimal or no code changes.

  • Flexible: The developer can exploit Neon when needed, or use C/C++ when it is not, while avoiding many low-level engineering concerns.

However, intrinsics might not be the right choice in all situations:

  • There is a steeper learning curve to use Neon intrinsics than importing a library or relying on a compiler.

  • Hand-optimized assembly code might offer the greatest scope for performance improvement even if it is more difficult to write.

Example: Matrix multiplication

This example re-implements some C functions using Neon intrinsics. The example chosen does not reflect the full complexity of their application, but illustrates the use of intrinsics and is a starting point for more complex code.

Matrix multiplication is an operation performed in many data intensive applications. It is made up of groups of arithmetic operations which are repeated in a straightforward way:

Matrix multiplication diagram

The matrix multiplication process is as follows:

  1. Take a row in the first matrix - 'A'

  2. Perform a dot product of this row with a column from the second matrix - 'B'

  3. Store the result in the corresponding row and column of a new matrix - 'C'

For matrices of 32-bit floats, the multiplication could be written as:

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int j_idx=0; j_idx < m; j_idx++) {
            C[n*j_idx + i_idx] = 0;
            for (int k_idx=0; k_idx < k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

Assume a column-major layout of the matrices in memory. That is, an n x m matrix M, is represented as an array M_array, where Mij = M_array[n*j + i].

This code is sub-optimal, because it does not make full use of Neon. Intrinsics can be used to improve it.

In this example, we will examine small, fixed-size matrices before moving on to larger matrices.

The following code uses intrinsics to multiply two 4x4 matrices. Since there is a small, fixed number of values to process, all of which can fit into the Neon registers of the processor at once, the loops can be completely unrolled.

void matrix_multiply_4x4_neon(float32_t *A, float32_t *B, float32_t *C) {
    // these are the columns A
    float32x4_t A0;
    float32x4_t A1;
    float32x4_t A2;
    float32x4_t A3;

    // these are the columns B
    float32x4_t B0;
    float32x4_t B1;
    float32x4_t B2;
    float32x4_t B3;

    // these are the columns C
    float32x4_t C0;
    float32x4_t C1;
    float32x4_t C2;
    float32x4_t C3;

    A0 = vld1q_f32(A);
    A1 = vld1q_f32(A+4);
    A2 = vld1q_f32(A+8);
    A3 = vld1q_f32(A+12);

    // Zero accumulators for C values
    C0 = vmovq_n_f32(0);
    C1 = vmovq_n_f32(0);
    C2 = vmovq_n_f32(0);
    C3 = vmovq_n_f32(0);

    // Multiply accumulate in 4x1 blocks, i.e. each column in C
    B0 = vld1q_f32(B);
    C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
    C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
    C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
    C0 = vfmaq_laneq_f32(C0, A3, B0, 3);
    vst1q_f32(C, C0);

    B1 = vld1q_f32(B+4);
    C1 = vfmaq_laneq_f32(C1, A0, B1, 0);
    C1 = vfmaq_laneq_f32(C1, A1, B1, 1);
    C1 = vfmaq_laneq_f32(C1, A2, B1, 2);
    C1 = vfmaq_laneq_f32(C1, A3, B1, 3);
    vst1q_f32(C+4, C1);

    B2 = vld1q_f32(B+8);
    C2 = vfmaq_laneq_f32(C2, A0, B2, 0);
    C2 = vfmaq_laneq_f32(C2, A1, B2, 1);
    C2 = vfmaq_laneq_f32(C2, A2, B2, 2);
    C2 = vfmaq_laneq_f32(C2, A3, B2, 3);
    vst1q_f32(C+8, C2);

    B3 = vld1q_f32(B+12);
    C3 = vfmaq_laneq_f32(C3, A0, B3, 0);
    C3 = vfmaq_laneq_f32(C3, A1, B3, 1);
    C3 = vfmaq_laneq_f32(C3, A2, B3, 2);
    C3 = vfmaq_laneq_f32(C3, A3, B3, 3);
    vst1q_f32(C+12, C3);
}

Fixed-size 4x4 matrices are chosen because:

  • Some applications need 4x4 matrices specifically, for example: graphics or relativistic physics.

  • The Neon vector registers hold four 32-bit values. Matching the program to the architecture makes it easier to optimize.

  • This 4x4 kernel can be used in a more general kernel.

Summarizing the intrinsics that have been used here:

Code element

What is it?

Whay are we using it?

float32x4_t

An array of four 32-bit floats.

One uint32x4_t fits into a 128-bit register. We can ensure there are no wasted register bits even in C code.

vld1q_f32(...)

A function which loads four 32-bit floats into a float32x4_t.

To get the matrix values we need from A and B.

vfmaq_lane_f32(...)

A function which uses the fused multiply accumulate instruction. Multiplies a float32x4_t value by a single element of another float32x4_t then adds the result to a third float32x4_t before returning the result.

Since the matrix row-on-column dot products are a set of multiplications and additions, this operation fits quite naturally.

vst1q_f32(...)

A function which stores a float32x4_t at a given address.

To store the results after they are calculated.

To multiply larger matrices, treat them as blocks of 4x4 matrices. However, this approach only works with matrix sizes which are a multiple of four in both dimensions. To use use this method without changing it, pad the matrix with zeroes.

The code for a more general matrix multiplication is listed below. The structure of the kernel has changed very little, with the addition of loops and address calculations being the major changes. Like in the 4x4 kernel, unique variable names are used for the B columns. The alternative would be to use one variable and re-load it. This acts as a hint to the compiler to assign different registers to these variables. Assigning different registers enables the processor to complete the arithmetic instructions for one column, while waiting on the loads for another.

void matrix_multiply_neon(float32_t  *A, float32_t  *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    /*
     * Multiply matrices A and B, store the reult in C.
     * It is the users responsibility to make sure the matrices are compatible.
     */

    int A_idx;
    int B_idx;
    int C_idx;

    // these are the columns of a 4x4 sub matrix of A
    float32x4_t A0;
    float32x4_t A1;
    float32x4_t A2;
    float32x4_t A3;

    // these are the columns of a 4x4 sub matrix of B
    float32x4_t B0;
    float32x4_t B1;
    float32x4_t B2;
    float32x4_t B3;

    // these are the columns of a 4x4 sub matrix of C
    float32x4_t C0;
    float32x4_t C1;
    float32x4_t C2;
    float32x4_t C3;

    for (int i_idx=0; i_idx<n; i_idx+=4 {
            for (int j_idx=0; j_idx<m; j_idx+=4){
                 // zero accumulators before matrix op
                 c0=vmovq_n_f32(0);
                 c1=vmovq_n_f32(0);
                 c2=vmovq_n_f32(0);
                 c3=vmovq_n_f32(0);
                 for (int k_idx=0; k_idx<k; k_idx+=4){
                      // compute base index to 4x4 block
                      a_idx = i_idx + n*k_idx;
                      b_idx = k*j_idx k_idx;

                      // load most current a values in row
                      A0=vld1q_f32(A+A_idx);
                      A1=vld1q_f32(A+A_idx+n);
                      A2=vld1q_f32(A+A_idx+2*n);
                      A3=vld1q_f32(A+A_idx+3*n);

                      // multiply accumulate 4x1 blocks, i.e. each column C
                      B0=vld1q_f32(B+B_idx);
                      C0=vfmaq_laneq_f32(C0,A0,B0,0);
                      C0=vfmaq_laneq_f32(C0,A1,B0,1);
                      C0=vfmaq_laneq_f32(C0,A2,B0,2);
                      C0=vfmaq_laneq_f32(C0,A3,B0,3);

                      B1=v1d1q_f32(B+B_idx+k);
                      C1=vfmaq_laneq_f32(C1,A0,B1,0);
                      C1=vfmaq_laneq_f32(C1,A1,B1,1);
                      C1=vfmaq_laneq_f32(C1,A2,B1,2);
                      C1=vfmaq_laneq_f32(C1,A3,B1,3);

                      B2=vld1q_f32(B+B_idx+2*k);
                      C2=vfmaq_laneq_f32(C2,A0,B2,0);
                      C2=vfmaq_laneq_f32(C2,A1,B2,1);
                      C2=vfmaq_laneq_f32(C2,A2,B2,2);
                      C2=vfmaq_laneq_f32(C2,A3,B3,3);

                      B3=vld1q_f32(B+B_idx+3*k);
                      C3=vfmaq_laneq_f32(C3,A0,B3,0);
                      C3=vfmaq_laneq_f32(C3,A1,B3,1);
                      C3=vfmaq_laneq_f32(C3,A2,B3,2);
                      C3=vfmaq_laneq_f32(C3,A3,B3,3);
                }
   //Compute base index for stores
   C_idx = n*j_idx + i_idx;
   vstlq_f32(C+C_idx, C0);
   vstlq_f32(C+C_idx+n,Cl);
   vstlq_f32(C+C_idx+2*n,C2);
   vstlq_f32(C+C_idx+3*n,C3);
  }
 }
}

Compiling and disassembling this function, and comparing it with the C function shows:

  • Fewer arithmetic instructions for a given matrix multiplication, because we are leveraging the Advanced SIMD technology with full register packing. Typical C code, generally, does not do this.

  • FMLA instead of FMUL instructions. As specified by the intrinsics.

  • Fewer loop iterations. When used properly intrinsics allow loops to be unrolled easily.

However, there are unnecessary loads and stores due to memory allocation and initialization of data types (for example, float32x4_t) which are not used in the pure C code.

Program conventions

Macros

In order to use the intrinsics, the Advanced SIMD architecture must be supported. Some specific instructions might not be enabled. When the following macros are defined and equal to 1, the corresponding features are available:

  • __aarch64__

    • Selection of architecture dependent source at compile time.

    • Always 1 for AArch64.

  • _ARM_NEON

    • Advanced SIMD is supported by the compiler.

    • Always 1 for AArch64.

  • _ARM_NEON_FP

    • Neon floating-point operations are supported.

    • Always 1 for AArch64

  • _ARM_FEATURE_CRYPTO

    • Crypto instructions are available.

    • Cryptographic Neon intrinsics are therefore available.

  • _ARM_FEATURE_FMA

    • The fused multiply-accumulate instructions are available.

    • Neon intrinsics which use these are therefore available.

This list is not exhaustive and further macros are detailed in the Arm C Language Extensions document.

Types

There are three major categories of data type available in arm_neon.h which follow these patterns:

baseW_t

Scalar data types

baseWxL_t

Vector data types

baseWxLxN_t

Vector array data types

Where:

  • base refers to the fundamental data type.

  • W is the width of the fundamental type.

  • L is the number of scalar data type instances in a vector data type, for example an array of scalars.

  • N is the number of vector data type instances in a vector array type, for example a struct of arrays of scalars.

Generally, W and L are such that the vector data types are 64 or 128 bits long, and so fit completely into a Neon register. N corresponds with those instructions which operate on multiple registers at once.

Functions

As per the Arm C Language Extensions, the function prototypes from arm_neon.h follow a common pattern. At the most general level this is:

ret v[p][q][r]name[u][n][q][x][_high][_lane | laneq][_n][_result]_type(args)

Some of the letters and names are overloaded, but in the order above:

ret

The return type of the function.

v

Short for vector and is present on all the intrinsics.

p

Indicates a pairwise operation. ([value] means value may be present).

q

Indicates a saturating operation (with the exception of vqtb[l][x] in AArch64 operations where the q indicates 128-bit index and result operands).

r

Indicates a rounding operation.

name

The descriptive name of the basic operation. Often, this is an Advanced SIMD instruction, but it does not have to be.

u

Indicates signed-to-unsigned saturation.

n

Indicates a narrowing operation.

q

Postfixing the name indicates an operation on 128-bit vectors.

x

Indicates an Advanced SIMD scalar operation in AArch64. It can be one of b, h, s, or d (that is, 8, 16, 32, or 64 bits).

_high

In AArch64, used for widening and narrowing operations involving 128-bit operands. For widening 128-bit operands, high refers to the top 64-bits of the source operand (or operands). For narrowing, it refers to the top 64-bits of the destination operand.

_n

Indicates a scalar operand supplied as an argument.

_lane

Indicates a scalar operand taken from the lane of a vector. _laneq indicates a scalar operand taken from the lane of an input vector of 128-bit width. ( left | right means only left or right would appear).

type

The primary operand type in short form.

args

The arguments of the function.

Quick reference

What is Neon?

Neon is the implementation of the Advanced SIMD extension to the Arm architecture.

All processors compliant with the Arm®v8-A architecture (for example, the Cortex-A76 or Cortex-A57) include Neon. In the developer's view, Neon provides an additional 32 128-bit registers with instructions that operate on 8, 16, 32, or 64 bit lanes within these registers.

Which header file must you include in a C file in order to use the Neon intrinsics?

arm_neon.h

#include <arm_neon.h> must appear before the use of any Neon intrinsics.

What do the data types float64_t, poly64x2_t, and int8x8x3_t represent?

  • float64_t is a scalar type which is a 64-bit floating-point type.

  • poly64x2_t is a vector type of two 64-bit polynomial scalars.

  • int8x8x3_t is a vector array type of three vectors of eight 8-bit signed integers.

What does the int8x16_t vmulq_s8 (int8x16_t a, int8x16_t b) function do?

The mul in the function name indicates that this intrinsic uses the MUL instruction. The types of the arguments and return value (sixteen bytes of signed integers) inform you that this intrinsic maps to the following instruction:

MUL Vd.16B, Vn.16B, Vm.16B

This function multiplies corresponding elements of a and b, and returns the result.

The deinterleave function defined in this tutorial can only operate on blocks of sixteen 8 bit unsigned integers. If you had an array of uint8_t values that was not a multiple of sixteen in length, how might you account for this while: 1) Changing the arrays, but not the function? and 2) Changing the function, but not the arrays?

  1. Padding the arrays with zeros would be the simplest option, but padding might have to be accounted for in other functions.

  2. One method would be to use the Neon de-interleave for every whole multiple of sixteen values, and then use the C de-interleave for the remainder.

Was this page helpful? Yes No