Skip to Main Content Skip to Footer Navigation

Sorry, your browser is not supported. We recommend upgrading your browser. We have done our best to make all the documentation and resources available on old versions of Internet Explorer, but vector image support and the layout may not be optimal. Technical documentation is available as a PDF Download.

You copied the Doc URL to your clipboard.

Advanced SIMD (Neon) intrinsics

Introduction

The Advanced SIMD instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations on a range of integer and floating-point types.

Neon is an implementation of the Advanced SIMD instructions which is provided as an extension for some Cortex-A Series processors. Where this document refers to Neon instructions, such instructions refer to the Advanced SIMD instructions as described by the Arm Architecture Reference Manual [ARMARMv8].

The Advanced SIMD extension provides for arithmetic, logical and saturated arithmetic operations on 8-bit, 16-bit and 32-bit integers (and sometimes on 64-bit integers) and on 32-bit and 64-bit floating-point data, arranged in 64-bit and 128-bit vectors.

The intrinsics in this section provide C and C++ programmers with a simple programming model allowing easy access to code-generation of the Advanced SIMD instructions for both AArch64 and AArch32 execution states.

Concepts

The Advanced SIMD instructions are designed to improve the performance of multimedia and signal processing algorithms by operating on 64-bit or 128-bit vectors of elements of the same scalar data type.

For example, uint16x4_t is a 64-bit vector type consisting of four elements of the scalar uint16_t data type. Likewise, uint16x8_t is a 128-bit vector type consisting of eight uint16_t elements.

In a vector programming model, operations are performed in parallel across the elements of the vector. For example, vmul_u16(a, b) is a vector intrinsic which takes two uint16x4_t vector arguments a and b, and returns the result of multiplying corresponding elements from each vector together.

The Advanced SIMD extension also provides support for vector-by-lane and vector-by-scalar operations. In these operations, a scalar value is extracted from one element of a vector input, or provided directly, duplicated to create a new vector with the same number of elements as an input vector, and an operation is performed in parallel between this new vector and other input vectors.

For example, vmul_lane_u16(a, b, 1), is a vector-by-lane intrinsic which takes two uint16x4_t vector elements. From b, element 1 is extracted, a new vector is formed which consists of four copies of b, and this new vector is multiplied by a.

Reduction, cross-lane, and pairwise vector operations work on pairs of elements within a vector, or across the whole of a single vector performing the same operation between elements of that vector. For example, vaddv_u16(a) is a reduction intrinsic which takes a uint16x4_t vector, adds each of the four uint16_t elements together, and returns a uint16_t result containing the sum.

Vector data types

Vector data types are named as a lane type and a multiple. Lane type names are based on the types defined in <stdint.h>. For example,. int16x4_t is a vector of four int16_t values. The base types are int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, float16_t, float32_t, poly8_t, poly16_t, poly64_t, poly128_t and bfloat16_t`. The multiples are such that the resulting vector types are 64-bit and 128-bit. In AArch64, ``float64_t is also a base type.

Not all types can be used in all operations. Generally, the operations available on a type correspond to the operations available on the corresponding scalar type.

ACLE does not define whether int64x1_t is the same type as int64_t, or whether uint64x1_t is the same type as uint64_t, or whether poly64x1_t is the same as poly64_t for example for C++ overloading purposes.

float16 types are only available when the __fp16 type is defined, i.e. when supported by the hardware.

bfloat types are only available when the __bf16 type is defined, i.e. when supported by the hardware. The bfloat types are all opaque types. That is to say they can only be used by intrinsics.

Advanced SIMD Scalar data types

AArch64 supports Advanced SIMD scalar operations that work on standard scalar data types viz. int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, float32_t, float64_t.

Vector array data types

Array types are defined for multiples of 2, 3 or 4 of all the vector types, for use in load and store operations, in table-lookup operations, and as the result type of operations that return a pair of vectors. For a vector type <type>_t the corresponding array type is <type>x<length>_t. Concretely, an array type is a structure containing a single array element called val.

For example an array of two int16x4_t types is int16x4x2_t, and is represented as:

struct int16x4x2_t { int16x4_t val[2]; };

Note that this array of two 64-bit vector types is distinct from the 128-bit vector type int16x8_t.

Scalar data types

For consistency, <arm_neon.h> defines some additional scalar data types to match the vector types.

float32_t is defined as an alias for float.

If the __fp16 type is defined, float16_t is defined as an alias for it.

If the __bf16 type is defined, bfloat16_t is defined as an alias for it.

poly8_t, poly16_t, poly64_t and poly128_t are defined as unsigned integer types. It is unspecified whether these are the same type as uint8_t, uint16_t, uint64_t and uint128_t for overloading and mangling purposes.

float64_t is defined as an alias for double.

16-bit floating-point arithmetic scalar intrinsics

The architecture extensions introduced by Armv8.2-A [ARMARMv82] provide a set of data processing instructions which operate on 16-bit floating-point quantities. These instructions are available in both AArch64 and AArch32 execution states, for both Advanced SIMD and scalar floating-point values.

ACLE defines two sets of intrinsics which correspond to these data processing instructions; a set of scalar intrinsics, and a set of vector intrinsics.

The intrinsics introduced in this section use the data types defined by ACLE. In particular, scalar intrinsics use the float16_t type defined by ACLE as an alias for the __fp16 type, and vector intrinsics use the float16x4_t and float16x8_t vector types.

Where the scalar 16-bit floating point intrinsics are available, an implementation is required to ensure that including <arm_neon.h> has the effect of also including <arm_fp16.h>.

To only enable support for the scalar 16-bit floating-point intrinsics, the header <arm_fp16.h> may be included directly.

16-bit brain floating-point arithmetic scalar intrinsics

The architecture extensions introduced by Armv8.6-A [Bfloat16] provide a set of data processing instructions which operate on brain 16-bit floating-point quantities. These instructions are available in both AArch64 and AArch32 execution states, for both Advanced SIMD and scalar floating-point values.

The brain 16-bit floating-point format (bfloat) differs from the older 16-bit floating-point format (float16) in that the former has an 8-bit exponent similar to a single-precision floating-point format but has a 7-bit fraction.

ACLE defines two sets of intrinsics which correspond to these data processing instructions; a set of scalar intrinsics, and a set of vector intrinsics.

The intrinsics introduced in this section use the data types defined by ACLE. In particular, scalar intrinsics use the bfloat16_t type defined by ACLE as an alias for the __bf16 type, and vector intrinsics use the bfloat16x4_t and bfloat16x8_t vector types.

Where the 16-bit brain floating point intrinsics are available, an implementation is required to ensure that including <arm_neon.h> has the effect of also including <arm_bf16.h>.

To only enable support for the 16-bit brain floating-point intrinsics, the header <arm_bf16.h> may be included directly.

When __ARM_BF16_FORMAT_ALTERNATIVE is defined to 1 then these types are storage only and cannot be used with anything other than ACLE intrinsics. The underlying type for them is uint16_t.

Operations on data types

ACLE does not define implicit conversion between different data types. E.g.

int32x4_t x;
uint32x4_t y = x; // No representation change
float32x4_t z = x; // Conversion of integer to floating type

Is not portable. Use the vreinterpret intrinsics to convert from one vector type to another without changing representation, and use the vcvt intrinsics to convert between integer and floating types; for example:

int32x4_t x;
uint32x4_t y = vreinterpretq_u32_s32(x);
float32x4_t z = vcvt_f32_s32(x);

ACLE does not define static construction of vector types. E.g.

int32x4_t x = { 1, 2, 3, 4 };

Is not portable. Use the vcreate or vdup intrinsics to construct values from scalars.

In C++, ACLE does not define whether Advanced SIMD data types are POD types or whether they can be inherited from.

Compatibility with other vector programming models

ACLE does not specify how the Advanced SIMD Intrinsics interoperate with alternative vector programming models. Consequently, programmers should take particular care when combining the Advanced SIMD Intrinsics programming model with such programming models.

For example, the GCC vector extensions permit initialising a variable using array syntax, as so

#include "arm_neon.h"
...
uint32x2_t x = {0, 1}; // GCC extension.
uint32_t y = vget_lane_s32 (x, 0); // ACLE Neon Intrinsic.

But the definition of the GCC vector extensions is such that the value stored in y will depend on both the target architecture (AArch32 or AArch64) and whether the program is running in big- or little-endian mode.

It is recommended that Advanced SIMD Intrinsics be used consistently:

#include "arm_neon.h"
...
const int temp[2] = {0, 1};
uint32x2_t x = vld1_s32 (temp);
uint32_t y = vget_lane_s32 (x, 0);

Availability of Advanced SIMD intrinsics and Extensions

Availability of Advanced SIMD intrinsics

Advanced SIMD support is available if the __ARM_NEON macro is predefined (see Advanced SIMD architecture extension (Neon)). In order to access the Advanced SIMD intrinsics, it is necessary to include the <arm_neon.h> header.

#if __ARM_NEON
#include <arm_neon.h>
  /* Advanced SIMD intrinsics are now available to use.  */
#endif

Some intrinsics are only available when compiling for the AArch64 execution state. This can be determined using the __ARM_64BIT_STATE predefined macro (see A32/T32 instruction set architecture.

Availability of 16-bit floating-point vector interchange types

When the 16-bit floating-point data type __fp16 is available as an interchange type for scalar values, it is also available in the vector interchange types float16x4_t and float16x8_t. When the vector interchange types are available, conversion intrinsics between vector of __fp16 and vector of float types are provided.

This is indicated by the setting of bit 1 in __ARM_NEON_FP (see Neon floating-point).

#if __ARM_NEON_FP & 0x1
  /* 16-bit floating point vector types are available.  */
  float16x8_t storage;
#endif

Availability of fused multiply-accumulate intrinsics

Whenever fused multiply-accumulate is available for scalar operations, it is also available as a vector operation in the Advanced SIMD extension. When a vector fused multiply-accumulate is available, intrinsics are defined to access it.

This is indicated by __ARM_FEATURE_FMA (see Fused multiply-accumulate (FMA)).

#if __ARM_FEATURE_FMA
  /* Fused multiply-accumulate intrinsics are available.  */
  float32x4_t a, b, c;
  vfma_f32 (a, b, c);
#endif

Availability of Armv8.1-A Advanced SIMD intrinsics

The Armv8.1-A [ARMARMv81] architecture introduces two new instructions: SQRDMLAH and SQRDMLSH. ACLE specifies vector and vector-by-lane intrinsics to access these instructions where they are available in hardware.

This is indicated by __ARM_FEATURE_QRDMX (see Rounding doubling multiplies).

#if __ARM_FEATURE_QRDMX
  /* Armv8.1-A RDMA extensions are available.  */
  int16x4_t a, b, c;
  vqrdmlah_s16 (a, b, c);
#endif

Availability of 16-bit floating-point arithmetic intrinsics

Armv8.2-A [ARMARMv82] introduces new data processing instructions which operate on 16-bit floating point data in the IEEE754-2008 [IEEE-FP] format. ACLE specifies intrinsics which map to the vector forms of these instructions where they are available in hardware.

This is indicated by __ARM_FEATURE_FP16_VECTOR_ARITHMETIC (see 16-bit floating-point data processing operations).

#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
  float16x8_t a, b;
  vaddq_f16 (a, b);
#endif

ACLE also specifies intrinsics which map to the scalar forms of these instructions, see 16-bit floating-point arithmetic scalar intrinsics. Availability of the scalar intrinsics is indicated by __ARM_FEATURE_FP16_SCALAR_ARITHMETIC.

#if __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
  float16_t a, b;
  vaddh_f16 (a, b);
#endif

Availability of 16-bit brain floating-point arithmetic intrinsics

Armv8.2-A [ARMARMv82] introduces new data processing instructions which operate on 16-bit brain floating point data as described in the Arm Architecture Reference Manual. ACLE specifies intrinsics which map to the vector forms of these instructions where they are available in hardware.

This is indicated by __ARM_FEATURE_BF16_VECTOR_ARITHMETIC (see Brain half-precision (16-bit) floating-point format).

#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC
  float32x2_t res = {0};
  bfloat16x4_t a' = vld1_bf16 (a);
  bfloat16x4_t b' = vld1_bf16 (b);
  res = vdot_bf16 (res, a', b');
#endif

ACLE also specifies intrinsics which map to the scalar forms of these instructions, see 16-bit brain floating-point arithmetic scalar intrinsics. Availability of the scalar intrinsics is indicated by __ARM_FEATURE_BF16_SCALAR_ARITHMETIC.

#if __ARM_FEATURE_BF16_SCALAR_ARITHMETIC
  bfloat16_t a;
  float32_t b = ..;
  a = b<convert> (b);
#endif

Availability of Armv8.4-A Advanced SIMD intrinsics

New Crypto and FP16 Floating Point Multiplication Variant instructions in Armv8.4-A:

  • New SHA512 crypto instructions (available if __ARM_FEATURE_SHA512)
  • New SHA3 crypto instructions (available if __ARM_FEATURE_SHA3)
  • SM3 crypto instructions (available if __ARM_FEATURE_SM3)
  • SM4 crypto instructions (available if __ARM_FEATURE_SM4)
  • New FML[A|S] instructions (available if __ARM_FEATURE_FP16_FML).

These instructions have been backported as optional instructions to Armv8.2-A and Armv8.3-A.

Availability of Dot Product intrinsics

The architecture extensions introduced by Armv8.2-A provide a set of dot product instructions which operate on 8-bit sub-element quantities. These instructions are available in both AArch64 and AArch32 execution states using Advanced SIMD instructions. These intrinsics are available when __ARM_FEATURE_DOTPROD is defined (see Dot Product extension).

#if __ARM_FEATURE_DOTPROD
  uint8x8_t a, b;
  vdot_u8 (a, b);
#endif

Availability of Armv8.5-A floating-point rounding intrinsics

The architecture extensions introduced by Armv8.5-A provide a set of floating-point rounding instructions that round a floating-point number to an to a floating-point value that would be representable in a 32-bit or 64-bit signed integer type. NaNs, Infinities and Out-of-Range values are forced to the Most Negative Integer representable in the target size, and an Invalid Operation Floating-Point Exception is generated. These instructions are available only in the AArch64 execution state. The intrinsics for these are available when __ARM_FEATURE_FRINT is defined. The Advanced SIMD intrinsics are specified in the Arm Neon Intrinsics Reference Architecture Specification [Neon].

Availability of Armv8.6-A Integer Matrix Multiply intrinsics

The architecture extensions introduced by Armv8.6-A provide a set of integer matrix multiplication and mixed sign dot product instructions. These instructions are optional from Armv8.2-A to Armv8.5-A.

These intrinsics are available when __ARM_FEATURE_MATMUL_INT8 is defined (see Matrix Multiply Intrinsics).

Specification of Advanced SIMD intrinsics

The Advanced SIMD intrinsics are specified in the Arm Neon Intrinsics Reference Architecture Specification [Neon].

The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

Undefined behavior

Care should be taken by compiler implementers not to introduce the concept of undefined behavior to the semantics of an intrinsic. For example, the vabsd_s64 intrinsic has well defined behaviour for all input values, while the C99 llabs has undefined behaviour if the result would not be representable in a long long type. It would thus be incorrect to implement vabsd_s64 as a wrapper function or macro around llabs.

Alignment assertions

The AArch32 Neon load and store instructions provide for alignment assertions, which may speed up access to aligned data (and will fault access to unaligned data). The Advanced SIMD intrinsics do not directly provide a means for asserting alignment.

Was this page helpful? Yes No