Introducing NEON for Arm®v8-A
This guide introduces Arm Neon technology, the Advanced SIMD (Single Instruction Multiple Data) architecture extension for implementation of the Armv8-A architecture profile.
Neon technology provides a dedicated extension to the Instruction Set Architecture, providing additional instructions that can perform mathematical operations in parallel on multiple data streams.
Neon can be used to accelerate the core algorithms used in many compute-intensive applications, and is commonly used by core maths libraries. Neon can also accelerate signal processing algorithms and functions to speed up applications such as audio and video processing, voice and facial recognition, computer vision, and deep learning.
As an application developer, there are a number of ways you can make use of Neon technology:
Auto-vectorization features in your compiler can automatically optimize your code to take advantage of Neon.
Neon intrinsics are function calls that the compiler replaces with appropriate Neon instructions. This gives you direct, low-level access to the exact Neon instructions you want, from C/C++ code.
Hand-coded Neon assembler can be an alternative approach for experienced programmers.
Before you begin
If you are completely new to Arm technology, you can read the Cortex-A Series Programmer's Guide for general information about the Arm architecture and programming guidelines.
The information in this guide relates to Neon for Arm®v8.
If you are hand-coding in assembler for a specific device, refer to the Technical Reference Manual (TRM) for that processor to see the microarchitectural details that can help you maximize performance. For some processors, Arm also publishes a Software Optimization Guide which might be of use. For example, see the Arm Cortex-A75 Technical Reference Manual and the Arm Cortex-A75 Software Optimization Guide.
Data processing methodologies
When processing large sets of data, a major performance-limiting factor is the amount of CPU time taken to perform data processing instructions. This CPU time depends on the number of instructions it takes to deal with the entire data set. And the number of instructions depends on how many items of data each instruction can process.
Single Instruction Single Data (SISD)
Most Arm instructions are Single Instruction Single Data (SISD). Each instruction performs its specified operation on a single datum. Processing multiple items requires multiple instructions. For example, to perform four addition operations, requires four instructions to add values from four pairs of registers:
ADD x0, x0, x5 ADD x1, x1, x6 ADD x2, x2, x7 ADD x3, x3, x8
This method is relatively slow and it can be difficult to see how different registers are related. To improve performance and efficiency, media processing is often off-loaded to dedicated processors such as a Graphics Processing Unit (GPU) or Media Processing Unit which can process more than one data value with a single instruction.
If the values you are dealing with are smaller than the maximum bit size, that extra potential bandwidth is wasted with SISD instructions. For example, when adding 8-bit values together, each 8-bit value needs to be loaded into a separate 64-bit register. Performing large numbers of individual operations on small data sizes does not use machine resources efficiently because processor, registers, and data path are all designed for 64-bit calculations.
Single Instruction Multiple Data (SIMD)
Single Instruction Multiple Data (SIMD) instructions perform the same operation simultaneously for multiple items. These items are packed as separate lanes in a larger register. For example, the following instruction adds four pairs of single-precision (32-bit) values together. However, in this case, the values are packed as separate lanes in two pairs of 128-bit registers. Each lane in the first source register is then added to the corresponding lane in the second source register, before being stored in the destination register:
ADD Q10.4S, Q8.4S, Q9.4S
In the above example, this operation adds two 128-bit (quadword) registers, Q8 and Q9, and stores the result in Q10. Each of the four 32-bit lanes in each register is added separately. There are no carries between the lanes.
This single instruction operates on all data values in the large register at the same time:
Performing the four operations with a single SIMD instruction is faster than with four separate SISD instructions. The diagram shows 128-bit registers each holding four 32-bit values, but other combinations are possible for Neon registers:
Four 32-bit, eight 16-bit, or sixteen 8-bit integer data elements can be operated on simultaneously in a single 128-bit register.
Two 32-bit, four 16-bit, or eight 8-bit integer data elements can be operated on simultaneously in a single 64-bit register.
Media processors, such as used in mobile devices, often split each full data register into multiple sub-registers and perform computations on the sub-registers in parallel. If the processing for the data sets are simple and repeated many times, SIMD can give considerable performance improvements. It is also beneficial for:
Audio, video, and image processing codecs.
2D graphics based on rectangular blocks of pixels.
Fundamentals of Arm®v8 Neon technology
Arm®v8 includes both 32-bit execution and 64-bit execution states, each with their own instruction sets:
AArch64 is the name used to describe the 64-bit execution state of the Arm®v8 architecture.
In AArch64 state, the processor executes the A64 instruction set, which contains Neon instructions (also referred to as SIMD instructions).
AArch32 describes the 32-bit execution state of the Arm®v8 architecture, which is almost identical to Arm®v7.
GNU and Linux documentation sometimes refers to AArch64 as ARM64.
In AArch32 state, the processor can execute either the A32 (called ARM in earlier versions of the architecture) or the T32 (Thumb) instruction set. The A32 and T32 instruction sets are backwards compatible with Arm®v7, including Neon instructions.
Registers, vectors, lanes, and elements
The Neon unit operates on a separate register file of 128-bit registers. The Neon unit is fully integrated into the processor and shares the processor resources for integer operation, loop control, and caching. This significantly reduces the area and power cost compared to a hardware accelerator. It also uses a much simpler programming model, since the Neon unit uses the same address space as the application.
The Neon register file is a collection of registers which can be accessed as 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit registers.
The Neon registers contain vectors of elements of the same data type. A vector is divided into lanes and each lane contains a data value called an element.
Usually each Neon instruction results in n operations occurring in parallel, where n is the number of lanes that the input vectors are divided into. Each operation is contained within the lane. There cannot be a carry or overflow from one lane to another. The number of lanes in a Neon vector depends on the size of the vector and the data elements in the vector. A 128-bit Neon vector can contain the following element sizes:
Sixteen 8-bit elements (operand suffix
Eight 16-bit elements (operand suffix
Four 32-bit elements (operand suffix
Two 64-bit elements (operand suffix
A 64-bit Neon vector can contain the following element sizes:
Eight 8-bit elements (operand suffix
Four 16-bit elements (operand suffix
Two 32-bit elements (operand suffix
If you want the equivalent of 1D, use
Elements in a vector are ordered from the least significant bit. That is, element
0 uses the least significant bits of the register.
Looking at an example of a Neon instruction, the instruction
ADD V0.8H, V1.8H, V2.8H performs a parallel addition of eight lanes of 16-bit (8 x 16 = 128) integer elements from vectors in
V2, storing the result in
The Optimizing C Code with Neon Intrinsics topic provides a useful introduction to Neon programming. The tutorial describes how to use Neon intrinsics by examining an example which processes a matrix multiplication.