You copied the Doc URL to your clipboard.

Optimizing C/C++ code with Arm SIMD (Neon)

Describes how to optimize with Advanced SIMD (Neon) using Arm C/C++ Compiler.

The Arm SIMD (or Advanced SIMD) architecture, its associated implementations, and supporting software, are commonly referred to as Neon technology. There are SIMD instruction sets for both AArch32 (equivalent to the Armv7 instructions) and for AArch64. Both can be used to significantly accelerate repetitive operations on the large data sets commonly encountered with High Performance Computing (HPC) applications.

Arm SIMD instructions perform "Packed SIMD" processing; the SIMD instructions pack multiple lanes of data into large registers, then perform the same operation across all data lanes.

For example, consider the following SIMD instruction:

ADD V0.2D, V1.2D, V2.2D

The instruction specifies that an addition (ADD) operation is performed on two 64-bit data lanes (2D). D specifies the width of the data lane (doubleword, or 64 bits) and 2 specifies that two lanes are used (that is the full 128-bit register). Each lane in V1 is added to the corresponding lane in V2 and the result stored in V0. Each lane is added separately. There are no carries between the lanes.

Coding with SIMD

To take advantage of SIMD instructions in your code:

  • Let the compiler auto-vectorize your code for you.

    Arm C/C++ Compiler automatically vectorizes your code at higher optimization levels (-O2 and higher). The compiler identifies appropriate vectorization opportunities in your code and uses SIMD instructions where appropriate.

    At optimization level -O1 you can use the -fvectorize option to enable auto-vectorization.

    At the lowest optimization level -O0 auto-vectorization is never performed, even if you specify -fvectorize.

  • Use intrinsics directly in your C code.

    Intrinsics are C or C++ pseudo-function calls that the compiler replaces with the appropriate SIMD instructions. Intrinsics let you use the data types and operations available in the SIMD implementation, while allowing the compiler to handle instruction scheduling and register allocation. The available intrinsics are defined in the language extensions document.

  • Write SIMD assembly code.

    Although it is technically possible to optimize SIMD assembly by hand, it can be very difficult because the pipeline and memory access timings have complex inter-dependencies. Instead of hand-writing assembly, Arm recommends the use of intrinsics.

Was this page helpful? Yes No