Fundamentals of Armv8 Neon technology
Armv8-A includes both 32-bit and 64-bit Execution states, each with their own instruction sets:
AArch64 is the name used to describe the 64-bit Execution state of the Armv8-A architecture.
In AArch64 state, the processor executes the A64 instruction set, which contains Neon instructions (also referred to as SIMD instructions). GNU and Linux documentation sometimes refers to AArch64 as ARM64.
AArch32 describes the 32-bit Execution state of the Armv8-A architecture, which is almost identical to Armv7.
In AArch32 state, the processor can execute either the A32 (called ARM in earlier versions of the architecture) or the T32 (Thumb) instruction set. The A32 and T32 instruction sets are backwards compatible with Armv7, including Neon instructions.
This guide will focus on Neon programming using A64 instructions for the AArch64 Execution state of the Armv8-A architecture.
If you want to write Neon code to run in the AArch32 Execution state of the Armv8-A architecture, you should refer to version 1.0 of the Neon Programmer's Guide.
Registers, vectors, lanes and elements
If you are familiar with the Armv8-A architecture profile, you will have noticed that in AArch64 state, Armv8 cores are a 64-bit architecture and use 64-bit registers, but the Neon unit uses 128-bit registers for SIMD processing.
This is possible because the Neon unit operates on a separate register file of 128-bit registers. The Neon unit is fully integrated into the processor and shares the processor resources for integer operation, loop control, and caching. This significantly reduces the area and power cost compared to a hardware accelerator. It also uses a much simpler programming model, since the Neon unit uses the same address space as the application.
The Neon register file is a collection of registers which can be accessed as 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit registers.
The Neon registers contain vectors of elements of the same data type. The same element position in the input and output registers is referred to as a lane.
Usually each Neon instruction results in n operations occurring in parallel, where n is the number of lanes that the input vectors are divided into. Each operation is contained within the lane. There cannot be a carry or overflow from one lane to another.
The number of lanes in a Neon vector depends on the size of the vector and the data elements in the vector.
A 128-bit Neon vector can contain the following element sizes:
Sixteen 8-bit elements (operand suffix
Eight 16-bit elements (operand suffix
Four 32-bit elements (operand suffix
Two 64-bit elements (operand suffix
A 64-bit Neon vector can contain the following element sizes (with the upper 64 bits of the 128-bit register cleared to zero):
Eight 8-bit elements (operand suffix
Four 16-bit elements (operand suffix
Two 32-bit elements (operand suffix
Elements in a vector are ordered from the least significant bit. That is, element 0 uses the least significant bits of the register. Let’s look at an example of a Neon instruction. The instruction ADD V0.8H, V1.8H, V2.8H performs a parallel addition of eight lanes of 16-bit (8 x 16 = 128) integer elements from vectors in V1 and V2, storing the result in V0:
Some Neon instructions act on scalars together with vectors. Instructions which use scalars specify a lane index to refer to a specific element in a register. For example, the instruction
MUL V0.4S, V2.4S, V3.S multiplies each of the four 32-bit elements in
V2 by the 32-bit scalar value in lane 2 of
V3, storing the result vector in