# Advanced SIMD (Neon) intrinsics

## Introduction

The Advanced SIMD instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations on a range of integer and floating-point types.

Neon is an implementation of the Advanced SIMD instructions which is provided as an extension for some Cortex-A Series processors. Where this document refers to Neon instructions, such instructions refer to the Advanced SIMD instructions as described by the Arm Architecture Reference Manual [ARMARMv8].

The Advanced SIMD extension provides for arithmetic, logical and saturated arithmetic operations on 8-bit, 16-bit and 32-bit integers (and sometimes on 64-bit integers) and on 32-bit and 64-bit floating-point data, arranged in 64-bit and 128-bit vectors.

The intrinsics in this section provide C and C++ programmers with a simple programming model allowing easy access to code-generation of the Advanced SIMD instructions for both AArch64 and AArch32 execution states.

### Concepts

The Advanced SIMD instructions are designed to improve the performance of
multimedia and signal processing algorithms by operating on 64-bit or 128-bit
*vectors* of *elements* of the same *scalar* data type.

For example, `uint16x4_t`

is a 64-bit vector type consisting of four
elements of the scalar `uint16_t`

data type. Likewise, `uint16x8_t`

is
a 128-bit vector type consisting of eight `uint16_t`

elements.

In a vector programming model, operations are performed in parallel across
the elements of the vector. For example, `vmul_u16(a, b)`

is a vector
intrinsic which takes two `uint16x4_t`

vector arguments `a`

and `b`

,
and returns the result of multiplying corresponding elements from each vector
together.

The Advanced SIMD extension also provides support for *vector-by-lane*
and *vector-by-scalar* operations. In these operations, a scalar value
is extracted from one element of a vector input, or provided directly,
duplicated to create a new vector with the same number of elements as an
input vector, and an operation is performed in parallel between
this new vector and other input vectors.

For example, `vmul_lane_u16(a, b, 1)`

, is a vector-by-lane intrinsic
which takes two `uint16x4_t`

vector elements. From `b`

, element `1`

is extracted, a new vector is formed which consists of four copies of `b`

,
and this new vector is multiplied by `a`

.

*Reduction*, *cross-lane*, and *pairwise* vector operations work on pairs
of elements within a vector, or across the whole of a single vector
performing the same operation between elements of that vector. For example,
`vaddv_u16(a)`

is a reduction intrinsic which takes a `uint16x4_t`

vector, adds each of the four `uint16_t`

elements together, and returns
a `uint16_t`

result containing the sum.

### Vector data types

Vector data types are named as a lane type and a multiple. Lane type
names are based on the types defined in `<stdint.h>`

. For example,.
`int16x4_t`

is a vector of four `int16_t`

values. The base types are
`int8_t`

, `uint8_t`

, `int16_t`

, `uint16_t`

, `int32_t`

,
`uint32_t`

, `int64_t`

, `uint64_t`

, `float16_t`

, `float32_t`

,
`poly8_t`

, `poly16_t`

, `poly64_t`

, `poly128_t`

and ```
bfloat16_t`. The multiples are
such that the resulting vector types are 64-bit and 128-bit. In AArch64,
``float64_t
```

is also a base type.

Not all types can be used in all operations. Generally, the operations available on a type correspond to the operations available on the corresponding scalar type.

ACLE does not define whether `int64x1_t`

is the same type as `int64_t`

,
or whether `uint64x1_t`

is the same type as `uint64_t`

, or whether
`poly64x1_t`

is the same as `poly64_t`

for example for C++ overloading
purposes.

float16 types are only available when the `__fp16`

type is defined, i.e.
when supported by the hardware.

bfloat types are only available when the `__bf16`

type is defined, i.e.
when supported by the hardware. The bfloat types are all opaque types. That is
to say they can only be used by intrinsics.

### Advanced SIMD Scalar data types

AArch64 supports Advanced SIMD scalar operations that work on standard
scalar data types viz. `int8_t`

, `uint8_t`

, `int16_t`

, `uint16_t`

,
`int32_t`

, `uint32_t`

, `int64_t`

, `uint64_t`

, `float32_t`

,
`float64_t.`

### Vector array data types

Array types are defined for multiples of 2, 3 or 4 of all the vector
types, for use in load and store operations, in table-lookup operations,
and as the result type of operations that return a pair of vectors. For
a vector type `<type>_t`

the corresponding array type is
`<type>x<length>_t`

. Concretely, an array type is a structure containing
a single array element called val.

For example an array of two `int16x4_t`

types is `int16x4x2_t`

, and is
represented as:

```
struct int16x4x2_t { int16x4_t val[2]; };
```

Note that this array of two 64-bit vector types is distinct from the
128-bit vector type `int16x8_t`

.

### Scalar data types

For consistency, `<arm_neon.h>`

defines some additional scalar data types
to match the vector types.

`float32_t`

is defined as an alias for `float`

.

If the `__fp16`

type is defined, `float16_t`

is defined as an alias for
it.

If the `__bf16`

type is defined, `bfloat16_t`

is defined as an alias for it.

`poly8_t`

, `poly16_t`

, `poly64_t`

and `poly128_t`

are defined as
unsigned integer types. It is unspecified whether these are the same type as
`uint8_t`

, `uint16_t`

, `uint64_t`

and `uint128_t`

for overloading and
mangling purposes.

`float64_t`

is defined as an alias for `double`

.

### 16-bit floating-point arithmetic scalar intrinsics

The architecture extensions introduced by Armv8.2-A [ARMARMv82] provide a set of data processing instructions which operate on 16-bit floating-point quantities. These instructions are available in both AArch64 and AArch32 execution states, for both Advanced SIMD and scalar floating-point values.

ACLE defines two sets of intrinsics which correspond to these data processing instructions; a set of scalar intrinsics, and a set of vector intrinsics.

The intrinsics introduced in this section use the data types defined
by ACLE. In particular, scalar intrinsics use the `float16_t`

type
defined by ACLE as an alias for the `__fp16`

type, and vector intrinsics
use the `float16x4_t`

and `float16x8_t`

vector types.

Where the scalar 16-bit floating point intrinsics are available,
an implementation is required to ensure that including
`<arm_neon.h>`

has the effect of also including `<arm_fp16.h>`

.

To only enable support for the scalar 16-bit floating-point intrinsics,
the header `<arm_fp16.h>`

may be included directly.

### 16-bit brain floating-point arithmetic scalar intrinsics

The architecture extensions introduced by Armv8.6-A [Bfloat16] provide a set of data processing instructions which operate on brain 16-bit floating-point quantities. These instructions are available in both AArch64 and AArch32 execution states, for both Advanced SIMD and scalar floating-point values.

The brain 16-bit floating-point format (bfloat) differs from the older 16-bit floating-point format (float16) in that the former has an 8-bit exponent similar to a single-precision floating-point format but has a 7-bit fraction.

ACLE defines two sets of intrinsics which correspond to these data processing instructions; a set of scalar intrinsics, and a set of vector intrinsics.

The intrinsics introduced in this section use the data types defined
by ACLE. In particular, scalar intrinsics use the `bfloat16_t`

type
defined by ACLE as an alias for the `__bf16`

type, and vector intrinsics
use the `bfloat16x4_t`

and `bfloat16x8_t`

vector types.

Where the 16-bit brain floating point intrinsics are available,
an implementation is required to ensure that including
`<arm_neon.h>`

has the effect of also including `<arm_bf16.h>`

.

To only enable support for the 16-bit brain floating-point intrinsics,
the header `<arm_bf16.h>`

may be included directly.

When `__ARM_BF16_FORMAT_ALTERNATIVE`

is defined to 1 then these types are
storage only and cannot be used with anything other than ACLE intrinsics. The
underlying type for them is `uint16_t`

.

### Operations on data types

ACLE does not define implicit conversion between different data types. E.g.

```
int32x4_t x;
uint32x4_t y = x; // No representation change
float32x4_t z = x; // Conversion of integer to floating type
```

Is not portable. Use the `vreinterpret`

intrinsics to convert from one
vector type to another without changing representation, and use the `vcvt`

intrinsics to convert between integer and floating types; for example:

```
int32x4_t x;
uint32x4_t y = vreinterpretq_u32_s32(x);
float32x4_t z = vcvt_f32_s32(x);
```

ACLE does not define static construction of vector types. E.g.

```
int32x4_t x = { 1, 2, 3, 4 };
```

Is not portable. Use the `vcreate`

or `vdup`

intrinsics to construct values
from scalars.

In C++, ACLE does not define whether Advanced SIMD data types are POD types or whether they can be inherited from.

### Compatibility with other vector programming models

ACLE does not specify how the Advanced SIMD Intrinsics interoperate with alternative vector programming models. Consequently, programmers should take particular care when combining the Advanced SIMD Intrinsics programming model with such programming models.

For example, the GCC vector extensions permit initialising a variable using array syntax, as so

```
#include "arm_neon.h"
...
uint32x2_t x = {0, 1}; // GCC extension.
uint32_t y = vget_lane_s32 (x, 0); // ACLE Neon Intrinsic.
```

But the definition of the GCC vector extensions is such that the value stored in y will depend on both the target architecture (AArch32 or AArch64) and whether the program is running in big- or little-endian mode.

It is recommended that Advanced SIMD Intrinsics be used consistently:

```
#include "arm_neon.h"
...
const int temp[2] = {0, 1};
uint32x2_t x = vld1_s32 (temp);
uint32_t y = vget_lane_s32 (x, 0);
```

### Availability of Advanced SIMD intrinsics

Advanced SIMD support is available if the `__ARM_NEON`

macro is
predefined (see Advanced SIMD architecture extension (Neon)). In order to access the Advanced SIMD
intrinsics, it is necessary to include the `<arm_neon.h>`

header.

```
#if __ARM_NEON
#include <arm_neon.h>
/* Advanced SIMD intrinsics are now available to use. */
#endif
```

Some intrinsics are only available when compiling for the AArch64
execution state. This can be determined using the `__ARM_64BIT_STATE`

predefined macro (see A32/T32 instruction set architecture.

### Availability of 16-bit floating-point vector interchange types

When the 16-bit floating-point data type `__fp16`

is available as an
interchange type for scalar values, it is also available in the vector
interchange types `float16x4_t`

and `float16x8_t`

. When the vector
interchange types are available, conversion intrinsics between
vector of `__fp16`

and vector of `float`

types are provided.

This is indicated by the setting of bit 1 in `__ARM_NEON_FP`

(see Neon floating-point).

```
#if __ARM_NEON_FP & 0x1
/* 16-bit floating point vector types are available. */
float16x8_t storage;
#endif
```

### Availability of fused multiply-accumulate intrinsics

Whenever fused multiply-accumulate is available for scalar operations, it is also available as a vector operation in the Advanced SIMD extension. When a vector fused multiply-accumulate is available, intrinsics are defined to access it.

This is indicated by `__ARM_FEATURE_FMA`

(see Fused multiply-accumulate (FMA)).

```
#if __ARM_FEATURE_FMA
/* Fused multiply-accumulate intrinsics are available. */
float32x4_t a, b, c;
vfma_f32 (a, b, c);
#endif
```

### Availability of Armv8.1-A Advanced SIMD intrinsics

The Armv8.1-A [ARMARMv81] architecture introduces two new instructions: SQRDMLAH and SQRDMLSH. ACLE specifies vector and vector-by-lane intrinsics to access these instructions where they are available in hardware.

This is indicated by `__ARM_FEATURE_QRDMX`

(see Rounding doubling multiplies).

```
#if __ARM_FEATURE_QRDMX
/* Armv8.1-A RDMA extensions are available. */
int16x4_t a, b, c;
vqrdmlah_s16 (a, b, c);
#endif
```

### Availability of 16-bit floating-point arithmetic intrinsics

Armv8.2-A [ARMARMv82] introduces new data processing instructions which operate on 16-bit floating point data in the IEEE754-2008 [IEEE-FP] format. ACLE specifies intrinsics which map to the vector forms of these instructions where they are available in hardware.

This is indicated by `__ARM_FEATURE_FP16_VECTOR_ARITHMETIC`

(see 16-bit floating-point data processing operations).

```
#if __ARM_FEATURE_FP16_VECTOR_ARITHMETIC
float16x8_t a, b;
vaddq_f16 (a, b);
#endif
```

ACLE also specifies intrinsics which map to the scalar forms of these
instructions, see 16-bit floating-point arithmetic scalar intrinsics. Availability of the scalar
intrinsics is indicated by `__ARM_FEATURE_FP16_SCALAR_ARITHMETIC`

.

```
#if __ARM_FEATURE_FP16_SCALAR_ARITHMETIC
float16_t a, b;
vaddh_f16 (a, b);
#endif
```

### Availability of 16-bit brain floating-point arithmetic intrinsics

Armv8.2-A [ARMARMv82] introduces new data processing instructions which operate on 16-bit brain floating point data as described in the Arm Architecture Reference Manual. ACLE specifies intrinsics which map to the vector forms of these instructions where they are available in hardware.

This is indicated by `__ARM_FEATURE_BF16_VECTOR_ARITHMETIC`

(see Brain half-precision (16-bit) floating-point format).

```
#if __ARM_FEATURE_BF16_VECTOR_ARITHMETIC
float32x2_t res = {0};
bfloat16x4_t a' = vld1_bf16 (a);
bfloat16x4_t b' = vld1_bf16 (b);
res = vdot_bf16 (res, a', b');
#endif
```

ACLE also specifies intrinsics which map to the scalar forms of these
instructions, see 16-bit brain floating-point arithmetic scalar intrinsics. Availability of the scalar
intrinsics is indicated by `__ARM_FEATURE_BF16_SCALAR_ARITHMETIC`

.

```
#if __ARM_FEATURE_BF16_SCALAR_ARITHMETIC
bfloat16_t a;
float32_t b = ..;
a = b<convert> (b);
#endif
```

### Availability of Armv8.4-A Advanced SIMD intrinsics

New Crypto and FP16 Floating Point Multiplication Variant instructions in Armv8.4-A:

- New SHA512 crypto instructions (available if
`__ARM_FEATURE_SHA512`

) - New SHA3 crypto instructions (available if
`__ARM_FEATURE_SHA3`

) - SM3 crypto instructions (available if
`__ARM_FEATURE_SM3`

) - SM4 crypto instructions (available if
`__ARM_FEATURE_SM4`

) - New FML[A|S] instructions (available if
`__ARM_FEATURE_FP16_FML`

).

These instructions have been backported as optional instructions to Armv8.2-A and Armv8.3-A.

### Availability of Dot Product intrinsics

The architecture extensions introduced by Armv8.2-A provide a set of dot product
instructions which operate on 8-bit sub-element quantities. These instructions
are available in both AArch64 and AArch32 execution states using
Advanced SIMD instructions. These intrinsics are available
when `__ARM_FEATURE_DOTPROD`

is defined (see Dot Product extension).

```
#if __ARM_FEATURE_DOTPROD
uint8x8_t a, b;
vdot_u8 (a, b);
#endif
```

### Availability of Armv8.5-A floating-point rounding intrinsics

The architecture extensions introduced by Armv8.5-A provide a set of
floating-point rounding instructions that round a floating-point number to an
to a floating-point value that would be representable in a 32-bit or 64-bit
signed integer type.
NaNs, Infinities and Out-of-Range values are forced to the
Most Negative Integer representable in the target size, and an
Invalid Operation Floating-Point Exception is generated.
These instructions are available only in the AArch64 execution state.
The intrinsics for these are available when `__ARM_FEATURE_FRINT`

is defined.
The Advanced SIMD intrinsics are specified in the Arm Neon Intrinsics
Reference Architecture Specification [Neon].

### Availability of Armv8.6-A Integer Matrix Multiply intrinsics

The architecture extensions introduced by Armv8.6-A provide a set of integer matrix multiplication and mixed sign dot product instructions. These instructions are optional from Armv8.2-A to Armv8.5-A.

These intrinsics are available when `__ARM_FEATURE_MATMUL_INT8`

is defined
(see Matrix Multiply Intrinsics).

## Specification of Advanced SIMD intrinsics

The Advanced SIMD intrinsics are specified in the Arm Neon Intrinsics Reference Architecture Specification [Neon].

The behavior of an intrinsic is specified to be equivalent to the AArch64 instruction it is mapped to in [Neon]. Intrinsics are specified as a mapping between their name, arguments and return values and the AArch64 instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

## Undefined behavior

Care should be taken by compiler implementers not to introduce the concept of
undefined behavior to the semantics of an intrinsic. For example, the
`vabsd_s64`

intrinsic has well defined behaviour for all input values,
while the C99 `llabs`

has undefined behaviour if the result would not
be representable in a `long long`

type. It would thus be incorrect to
implement `vabsd_s64`

as a wrapper function or macro around `llabs`

.