# M-profile Vector Extension (MVE) intrinsics

The M-profile Vector Extension (MVE) [MVE-spec] instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations on a range of integer and floating-point types. MVE can also be referred to as Helium.

The M-profile Vector Extension provides for arithmetic, logical and saturated arithmetic operations on 8-bit, 16-bit and 32-bit integers (and sometimes on 64-bit integers) and on 16-bit and 32-bit floating-point data, arranged in 128-bit vectors.

The intrinsics in this section provide C and C++ programmers with a simple programming model allowing easy access to the code generation of the MVE instructions for the Armv8.1-M Mainline architecture.

## Concepts

The MVE instructions are designed to improve the performance of SIMD operations
by operating on 128-bit *vectors* of *elements* of the same *scalar* data type.

For example, `uint16x8_t`

is a 128-bit vector type consisting of eight
elements of the scalar `uint16_t`

data type. Likewise, `uint8x16_t`

is
a 128-bit vector type consisting of sixteen `uint8_t`

elements.

In a vector programming model, operations are performed in parallel across
the elements of the vector. For example, `vmulq_u16(a, b)`

is a vector
intrinsic which takes two `uint16x8_t`

vector arguments `a`

and `b`

,
and returns the result of multiplying corresponding elements from each vector
together.

The M-profile Vector Extension also provides support for *vector-by-scalar*
operations. In these operations, a scalar value is provided directly,
duplicated to create a new vector with the same number of elements as an
input vector, and an operation is performed in parallel between
this new vector and other input vectors.

For example, `vaddq_n_u16(a, s)`

, is a vector-by-scalar intrinsic
which takes one `uint16x8_t`

vector argument and one `uint16_t`

scalar
argument. A new vector is formed which consists of eight copies of `s`

,
and this new vector is multiplied by `a`

.

*Reductions* work across the whole of a single vector performing the same
operation between elements of that vector. For example, `vaddvq_u16(a)`

is a
reduction intrinsic which takes a `uint16x8_t`

vector, adds each of the eight
`uint16_t`

elements together, and returns a `uint32_t`

result containing
the sum. Note the difference in return types between MVE’s `vaddvq_u16`

and
Advanced SIMD’s implementation of the same name intrinsic, MVE returns the
`uint32_t`

type whereas Advanced SIMD returns the element type `uint16_t`

.

*Cross-lane* and *pairwise* vector operations work on pairs of elements within
a vector, sometimes performing the same operation like in the case of the
vector saturating doubling multiply subtract dual returning high half with
exchange `vqdmlsdhxq_s8`

or sometimes a different one as is the case with the
vector complex addition intrinsic `vcaddq_rot90_s8`

.

Some intrinsics may only read part of the input vectors whereas others may only
write part of the results. For example, the vector multiply long intrinsics,
depending on whether you use `vmullbq_int_s32`

or `vmulltq_int_s32`

, will
read the even (bottom) or odd (top) elements of each `int16x8_t`

input
vectors, multiply them and write to a double-width `int32x4_t`

vector.
In contrast the vector shift right and narrow will read in a double-width input
vector and, depending on whether you pick the bottom or top variant, write to
the even or odd elements of the single-width result vector. For example,
`vshrnbq_n_s16(a, b, 2)`

will take each eight elements of type `int16_t`

of
argument `b`

, shift them right by two, narrow them to eight bits and write
them to the even elements of the `int8x16_t`

result vector, where the odd
elements are picked from the equally typed `int8x16_t`

argument `a`

.

*Predication*: the M-profile Vector Extension uses vector predication to allow
SIMD operations on selected lanes. The MVE intrinsics expose vector predication
by providing predicated intrinsic variants for instructions that support it.
These intrinsics can be recognized by one of the four suffixes:
* `_m`

(merging) which indicates that false-predicated lanes are not written
to and keep the same value as they had in the first argument of the intrinsic.
* `_p`

(predicated) which indicates that false-predicated lanes are not used
in the SIMD operation. For example `vaddvq_p_s8`

, where the false-predicated
lanes are not added to the resulting sum.
* `_z`

(zero) which indicates that false-predicated lanes are filled with
zeroes. These are only used for load instructions.
* `_x`

(dont-care) which indicates that the false-predicated lanes have
undefined values. These are syntactic sugar for merge intrinsics with a
`vuninitializedq`

inactive parameter.

These predicated intrinsics can also be recognized by their last parameter
being of type `mve_pred16_t`

. This is an alias for the `uint16_t`

type.
Some predicated intrinsics may have a dedicated first parameter to specify the
value in the result vector for the false-predicated lanes; this argument will
be of the same type as the result type. For example,
`v = veorq_m_s8(inactive, a, b, p)`

, will write to each of the sixteen lanes
of the result vector `v`

, either the result of the exclusive or between the
corresponding lanes of vectors `a`

and `b`

, or the corresponding lane of
vector `inactive`

, depending on whether that lane is true- or false-predicated
in `p`

. The types of `inactive`

, `a`

, `b`

and `v`

are all
`int8x16_t`

in this case and `p`

has type `mve_pred16_t`

.

For the MVE ACLE intrinsics, passing a mask that does not have all bits set to the same value per input-width sized lane to a predicated intrinsic is considered undefined behavior. That is, the user should make sure the producer of the mask uses an element size equal to or higher than the element size of the input vector parameters, other than the ‘inactive’ parameter, of the predicated intrinsic consuming the mask. For example:

```
mve_pred16_t mask8 = vcmpeqq_u8 (a, b);
uint8x16_t r8 = vaddq_u8 (a,b); // This is OK.
uint16x8_t r16 = vaddq_u16 (c, d); // This is UNDEFINED BEHAVIOR.
mve_pred16_t mask8 = 0x5555; // Predicate every other byte.
uint8x16_t r8 = vaddq_u8 (a,b); // This is OK.
uint16x8_t r16 = vaddq_u16 (c, d); // This is UNDEFINED BEHAVIOR.
```

Users wishing to exploit this predication behavior are encouraged to use inline assembly.

## Scalar shift intrinsics

The M-profile Vector Extension (MVE) also provides a set of scalar shift instructions that operate on signed and unsigned double-words and single-words. These shifts can perform additional saturation, rounding, or both. The ACLE for MVE defines intrinsics for these instructions.

## Namespace

By default all M-profile Vector Extension intrinsics are available with and
without the `__arm_`

prefix. If the `__ARM_MVE_PRESERVE_USER_NAMESPACE`

macro is defined, the `__arm_`

prefix is mandatory. This is available to hide
the user-namespace-polluting variants of the intrinsics.

## Intrinsic polymorphism

The ACLE for the M-profile Vector Extension intrinsics was designed in such a
way that it supports a polymorphic implementation of most intrinsics. The
polymorphic name of an intrinsic is indicated by leaving out the type suffix
enclosed in square brackets, for example the vector addition intrinsic
`vaddq[_s32]`

can be called using the function name `vaddq`

. Note that the
polymorphism is only possible on input parameter types and intrinsics with the
same name must still have the same number of parameters. This is expected to
aid implementation of the polymorphism using C11’s `_Generic`

selection.

## Vector data types

Vector data types are named as a lane type and a multiple. Lane type
names are based on the types defined in `<stdint.h>`

. For example,.
`int16x8_t`

is a vector of eight `int16_t`

values. The base types are
`int8_t`

, `uint8_t`

, `int16_t`

, `uint16_t`

, `int32_t`

,
`uint32_t`

, `int64_t`

, `uint64_t`

, `float16_t`

and `float32_t`

.
The multiples are such that the resulting vector types are 128-bit.

## Vector array data types

Array types are defined for multiples of 2 and 4 of all the vector types, for
use in load and store operations. For a vector type `<type>_t`

the
corresponding array type is `<type>x<length>_t`

. Concretely, an array type is
a structure containing a single array element called `val`

.

For example, an array of two `int16x8_t`

types is `int16x4x8_t`

, and is
represented as:

```
struct int16x8x2_t { int16x8_t val[2]; };
```

## Scalar data types

For consistency, `<arm_mve.h>`

defines some additional scalar data types
to match the vector types.

`float32_t`

is defined as an alias for `float`

, `float16_t`

is defined as
an alias for `__fp16`

and `mve_pred16_t`

is defined as an alias for
`uint16_t`

.

## Operations on data types

ACLE does not define implicit conversion between different data types. E.g.

```
int32x4_t x;
uint32x4_t y = x; // No representation change
float32x4_t z = x; // Conversion of integer to floating type
```

Is not portable. Use the `vreinterpretq`

intrinsics to convert from one
vector type to another without changing representation, and use the `vcvtq`

intrinsics to convert between integer and floating types; for example:

```
int32x4_t x;
uint32x4_t y = vreinterpretq_u32_s32(x);
float32x4_t z = vcvtq_f32_s32(x);
```

ACLE does not define static construction of vector types. E.g.

```
int32x4_t x = { 1, 2, 3, 4 };
```

Is not portable. Use the `vcreateq`

or `vdupq`

intrinsics to construct values
from scalars.

In C++, ACLE does not define whether MVE data types are POD types or whether they can be inherited from.

## Compatibility with other vector programming models

ACLE does not specify how the MVE Intrinsics interoperate with alternative vector programming models. Consequently, programmers should take particular care when combining the MVE programming model with such programming models.

For example, the GCC vector extensions permit initialising a variable using array syntax, as so

```
#include "arm_mve.h"
...
uint32x4_t x = {0, 1, 2, 3}; // GCC extension.
uint32_t y = vgetq_lane_s32 (x, 0); // ACLE MVE Intrinsic.
```

But the definition of the GCC vector extensions is such that the value
stored in `y`

will depend on whether the program is running in big- or
little-endian mode.

It is recommended that MVE Intrinsics be used consistently:

```
#include "arm_mve.h"
...
const int temp[4] = {0, 1, 2, 3};
uint32x4_t x = vld1q_s32 (temp);
uint32_t y = vgetq_lane_s32 (x, 0);
```

## Availability of M-profile Vector Extension intrinsics

M-profile Vector Extension support is available if the `__ARM_FEATURE_MVE`

macro has a value other than 0 (see M-profile Vector Extension). The availability of the
MVE Floating Point data types and intrinsics are predicated on the value of
this macro having bit two set. In order to access the MVE intrinsics, it is
necessary to include the `<arm_mve.h>`

header.

```
#if (__ARM_FEATURE_MVE & 3) == 3
#include <arm_mve.h>
/* MVE integer and floating point intrinsics are now available to use. */
#elif __ARM_FEATURE_MVE & 1
#include <arm_mve.h>
/* MVE integer intrinsics are now available to use. */
#endif
```

### Specification of M-profile Vector Extension intrinsics

The M-profile Vector Extension intrinsics are specified in the Arm MVE Intrinsics Reference Architecture Specification [MVE].

The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.