Sorry, your browser is not supported. We recommend upgrading your browser. We have done our best to make all the documentation and resources available on old versions of Internet Explorer, but vector image support and the layout may not be optimal. Technical documentation is available as a PDF Download.

You copied the Doc URL to your clipboard.

# M-profile Vector Extension (MVE) intrinsics

The M-profile Vector Extension (MVE) [MVE-spec] instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations on a range of integer and floating-point types. MVE can also be referred to as Helium.

The M-profile Vector Extension provides for arithmetic, logical and saturated arithmetic operations on 8-bit, 16-bit and 32-bit integers (and sometimes on 64-bit integers) and on 16-bit and 32-bit floating-point data, arranged in 128-bit vectors.

The intrinsics in this section provide C and C++ programmers with a simple programming model allowing easy access to the code generation of the MVE instructions for the Armv8.1-M Mainline architecture.

## Concepts

The MVE instructions are designed to improve the performance of SIMD operations by operating on 128-bit vectors of elements of the same scalar data type.

For example, `uint16x8_t` is a 128-bit vector type consisting of eight elements of the scalar `uint16_t` data type. Likewise, `uint8x16_t` is a 128-bit vector type consisting of sixteen `uint8_t` elements.

In a vector programming model, operations are performed in parallel across the elements of the vector. For example, `vmulq_u16(a, b)` is a vector intrinsic which takes two `uint16x8_t` vector arguments `a` and `b`, and returns the result of multiplying corresponding elements from each vector together.

The M-profile Vector Extension also provides support for vector-by-scalar operations. In these operations, a scalar value is provided directly, duplicated to create a new vector with the same number of elements as an input vector, and an operation is performed in parallel between this new vector and other input vectors.

For example, `vaddq_n_u16(a, s)`, is a vector-by-scalar intrinsic which takes one `uint16x8_t` vector argument and one `uint16_t` scalar argument. A new vector is formed which consists of eight copies of `s`, and this new vector is multiplied by `a`.

Reductions work across the whole of a single vector performing the same operation between elements of that vector. For example, `vaddvq_u16(a)` is a reduction intrinsic which takes a `uint16x8_t` vector, adds each of the eight `uint16_t` elements together, and returns a `uint32_t` result containing the sum. Note the difference in return types between MVE’s `vaddvq_u16` and Advanced SIMD’s implementation of the same name intrinsic, MVE returns the `uint32_t` type whereas Advanced SIMD returns the element type `uint16_t`.

Cross-lane and pairwise vector operations work on pairs of elements within a vector, sometimes performing the same operation like in the case of the vector saturating doubling multiply subtract dual returning high half with exchange `vqdmlsdhxq_s8` or sometimes a different one as is the case with the vector complex addition intrinsic `vcaddq_rot90_s8`.

Some intrinsics may only read part of the input vectors whereas others may only write part of the results. For example, the vector multiply long intrinsics, depending on whether you use `vmullbq_int_s32` or `vmulltq_int_s32`, will read the even (bottom) or odd (top) elements of each `int16x8_t` input vectors, multiply them and write to a double-width `int32x4_t` vector. In contrast the vector shift right and narrow will read in a double-width input vector and, depending on whether you pick the bottom or top variant, write to the even or odd elements of the single-width result vector. For example, `vshrnbq_n_s16(a, b, 2)` will take each eight elements of type `int16_t` of argument `b`, shift them right by two, narrow them to eight bits and write them to the even elements of the `int8x16_t` result vector, where the odd elements are picked from the equally typed `int8x16_t` argument `a`.

Predication: the M-profile Vector Extension uses vector predication to allow SIMD operations on selected lanes. The MVE intrinsics expose vector predication by providing predicated intrinsic variants for instructions that support it. These intrinsics can be recognized by one of the four suffixes: * `_m` (merging) which indicates that false-predicated lanes are not written to and keep the same value as they had in the first argument of the intrinsic. * `_p` (predicated) which indicates that false-predicated lanes are not used in the SIMD operation. For example `vaddvq_p_s8`, where the false-predicated lanes are not added to the resulting sum. * `_z` (zero) which indicates that false-predicated lanes are filled with zeroes. These are only used for load instructions. * `_x` (dont-care) which indicates that the false-predicated lanes have undefined values. These are syntactic sugar for merge intrinsics with a `vuninitializedq` inactive parameter.

These predicated intrinsics can also be recognized by their last parameter being of type `mve_pred16_t`. This is an alias for the `uint16_t` type. Some predicated intrinsics may have a dedicated first parameter to specify the value in the result vector for the false-predicated lanes; this argument will be of the same type as the result type. For example, `v = veorq_m_s8(inactive, a, b, p)`, will write to each of the sixteen lanes of the result vector `v`, either the result of the exclusive or between the corresponding lanes of vectors `a` and `b`, or the corresponding lane of vector `inactive`, depending on whether that lane is true- or false-predicated in `p`. The types of `inactive`, `a`, `b` and `v` are all `int8x16_t` in this case and `p` has type `mve_pred16_t`.

For the MVE ACLE intrinsics, passing a mask that does not have all bits set to the same value per input-width sized lane to a predicated intrinsic is considered undefined behavior. That is, the user should make sure the producer of the mask uses an element size equal to or higher than the element size of the input vector parameters, other than the ‘inactive’ parameter, of the predicated intrinsic consuming the mask. For example:

```mve_pred16_t mask8 = vcmpeqq_u8 (a, b);
uint8x16_t r8 = vaddq_u8 (a,b);         // This is OK.
uint16x8_t r16 = vaddq_u16 (c, d);      // This is UNDEFINED BEHAVIOR.
mve_pred16_t mask8 = 0x5555;            // Predicate every other byte.
uint8x16_t r8 = vaddq_u8 (a,b);         // This is OK.
uint16x8_t r16 = vaddq_u16 (c, d);      // This is UNDEFINED BEHAVIOR.
```

Users wishing to exploit this predication behavior are encouraged to use inline assembly.

## Scalar shift intrinsics

The M-profile Vector Extension (MVE) also provides a set of scalar shift instructions that operate on signed and unsigned double-words and single-words. These shifts can perform additional saturation, rounding, or both. The ACLE for MVE defines intrinsics for these instructions.

## Namespace

By default all M-profile Vector Extension intrinsics are available with and without the `__arm_` prefix. If the `__ARM_MVE_PRESERVE_USER_NAMESPACE` macro is defined, the `__arm_` prefix is mandatory. This is available to hide the user-namespace-polluting variants of the intrinsics.

## Intrinsic polymorphism

The ACLE for the M-profile Vector Extension intrinsics was designed in such a way that it supports a polymorphic implementation of most intrinsics. The polymorphic name of an intrinsic is indicated by leaving out the type suffix enclosed in square brackets, for example the vector addition intrinsic `vaddq[_s32]` can be called using the function name `vaddq`. Note that the polymorphism is only possible on input parameter types and intrinsics with the same name must still have the same number of parameters. This is expected to aid implementation of the polymorphism using C11’s `_Generic` selection.

## Vector data types

Vector data types are named as a lane type and a multiple. Lane type names are based on the types defined in `<stdint.h>`. For example,. `int16x8_t` is a vector of eight `int16_t` values. The base types are `int8_t`, `uint8_t`, `int16_t`, `uint16_t`, `int32_t`, `uint32_t`, `int64_t`, `uint64_t`, `float16_t` and `float32_t`. The multiples are such that the resulting vector types are 128-bit.

## Vector array data types

Array types are defined for multiples of 2 and 4 of all the vector types, for use in load and store operations. For a vector type `<type>_t` the corresponding array type is `<type>x<length>_t`. Concretely, an array type is a structure containing a single array element called `val`.

For example, an array of two `int16x8_t` types is `int16x4x8_t`, and is represented as:

```struct int16x8x2_t { int16x8_t val; };
```

## Scalar data types

For consistency, `<arm_mve.h>` defines some additional scalar data types to match the vector types.

`float32_t` is defined as an alias for `float`, `float16_t` is defined as an alias for `__fp16` and `mve_pred16_t` is defined as an alias for `uint16_t`.

## Operations on data types

ACLE does not define implicit conversion between different data types. E.g.

```int32x4_t x;
uint32x4_t y = x; // No representation change
float32x4_t z = x; // Conversion of integer to floating type
```

Is not portable. Use the `vreinterpretq` intrinsics to convert from one vector type to another without changing representation, and use the `vcvtq` intrinsics to convert between integer and floating types; for example:

```int32x4_t x;
uint32x4_t y = vreinterpretq_u32_s32(x);
float32x4_t z = vcvtq_f32_s32(x);
```

ACLE does not define static construction of vector types. E.g.

```int32x4_t x = { 1, 2, 3, 4 };
```

Is not portable. Use the `vcreateq` or `vdupq` intrinsics to construct values from scalars.

In C++, ACLE does not define whether MVE data types are POD types or whether they can be inherited from.

## Compatibility with other vector programming models

ACLE does not specify how the MVE Intrinsics interoperate with alternative vector programming models. Consequently, programmers should take particular care when combining the MVE programming model with such programming models.

For example, the GCC vector extensions permit initialising a variable using array syntax, as so

```#include "arm_mve.h"
...
uint32x4_t x = {0, 1, 2, 3}; // GCC extension.
uint32_t y = vgetq_lane_s32 (x, 0); // ACLE MVE Intrinsic.
```

But the definition of the GCC vector extensions is such that the value stored in `y` will depend on whether the program is running in big- or little-endian mode.

It is recommended that MVE Intrinsics be used consistently:

```#include "arm_mve.h"
...
const int temp = {0, 1, 2, 3};
uint32x4_t x = vld1q_s32 (temp);
uint32_t y = vgetq_lane_s32 (x, 0);
```

## Availability of M-profile Vector Extension intrinsics

M-profile Vector Extension support is available if the `__ARM_FEATURE_MVE` macro has a value other than 0 (see M-profile Vector Extension). The availability of the MVE Floating Point data types and intrinsics are predicated on the value of this macro having bit two set. In order to access the MVE intrinsics, it is necessary to include the `<arm_mve.h>` header.

```#if (__ARM_FEATURE_MVE & 3) == 3
#include <arm_mve.h>
/* MVE integer and floating point intrinsics are now available to use.  */
#elif __ARM_FEATURE_MVE & 1
#include <arm_mve.h>
/* MVE integer intrinsics are now available to use.  */
#endif
```

### Specification of M-profile Vector Extension intrinsics

The M-profile Vector Extension intrinsics are specified in the Arm MVE Intrinsics Reference Architecture Specification [MVE].

The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

### Undefined behavior

Care should be taken by compiler implementers not to introduce the concept of undefined behavior to the semantics of an intrinsic.

### Alignment assertions

The MVE load and store instructions provide for alignment assertions, which may speed up access to aligned data (and will fault access to unaligned data). The MVE intrinsics do not directly provide a means for asserting alignment.