Skip to Main Content Skip to Footer Navigation

Sorry, your browser is not supported. We recommend upgrading your browser. We have done our best to make all the documentation and resources available on old versions of Internet Explorer, but vector image support and the layout may not be optimal. Technical documentation is available as a PDF Download.

You copied the Doc URL to your clipboard.

M-profile Vector Extension (MVE) intrinsics

The M-profile Vector Extension (MVE) [MVE-spec] instructions provide packed Single Instruction Multiple Data (SIMD) and single-element scalar operations on a range of integer and floating-point types. MVE can also be referred to as Helium.

The M-profile Vector Extension provides for arithmetic, logical and saturated arithmetic operations on 8-bit, 16-bit and 32-bit integers (and sometimes on 64-bit integers) and on 16-bit and 32-bit floating-point data, arranged in 128-bit vectors.

The intrinsics in this section provide C and C++ programmers with a simple programming model allowing easy access to the code generation of the MVE instructions for the Armv8.1-M Mainline architecture.

Concepts

The MVE instructions are designed to improve the performance of SIMD operations by operating on 128-bit vectors of elements of the same scalar data type.

For example, uint16x8_t is a 128-bit vector type consisting of eight elements of the scalar uint16_t data type. Likewise, uint8x16_t is a 128-bit vector type consisting of sixteen uint8_t elements.

In a vector programming model, operations are performed in parallel across the elements of the vector. For example, vmulq_u16(a, b) is a vector intrinsic which takes two uint16x8_t vector arguments a and b, and returns the result of multiplying corresponding elements from each vector together.

The M-profile Vector Extension also provides support for vector-by-scalar operations. In these operations, a scalar value is provided directly, duplicated to create a new vector with the same number of elements as an input vector, and an operation is performed in parallel between this new vector and other input vectors.

For example, vaddq_n_u16(a, s), is a vector-by-scalar intrinsic which takes one uint16x8_t vector argument and one uint16_t scalar argument. A new vector is formed which consists of eight copies of s, and this new vector is multiplied by a.

Reductions work across the whole of a single vector performing the same operation between elements of that vector. For example, vaddvq_u16(a) is a reduction intrinsic which takes a uint16x8_t vector, adds each of the eight uint16_t elements together, and returns a uint32_t result containing the sum. Note the difference in return types between MVE’s vaddvq_u16 and Advanced SIMD’s implementation of the same name intrinsic, MVE returns the uint32_t type whereas Advanced SIMD returns the element type uint16_t.

Cross-lane and pairwise vector operations work on pairs of elements within a vector, sometimes performing the same operation like in the case of the vector saturating doubling multiply subtract dual returning high half with exchange vqdmlsdhxq_s8 or sometimes a different one as is the case with the vector complex addition intrinsic vcaddq_rot90_s8.

Some intrinsics may only read part of the input vectors whereas others may only write part of the results. For example, the vector multiply long intrinsics, depending on whether you use vmullbq_int_s32 or vmulltq_int_s32, will read the even (bottom) or odd (top) elements of each int16x8_t input vectors, multiply them and write to a double-width int32x4_t vector. In contrast the vector shift right and narrow will read in a double-width input vector and, depending on whether you pick the bottom or top variant, write to the even or odd elements of the single-width result vector. For example, vshrnbq_n_s16(a, b, 2) will take each eight elements of type int16_t of argument b, shift them right by two, narrow them to eight bits and write them to the even elements of the int8x16_t result vector, where the odd elements are picked from the equally typed int8x16_t argument a.

Predication: the M-profile Vector Extension uses vector predication to allow SIMD operations on selected lanes. The MVE intrinsics expose vector predication by providing predicated intrinsic variants for instructions that support it. These intrinsics can be recognized by one of the four suffixes: * _m (merging) which indicates that false-predicated lanes are not written to and keep the same value as they had in the first argument of the intrinsic. * _p (predicated) which indicates that false-predicated lanes are not used in the SIMD operation. For example vaddvq_p_s8, where the false-predicated lanes are not added to the resulting sum. * _z (zero) which indicates that false-predicated lanes are filled with zeroes. These are only used for load instructions. * _x (dont-care) which indicates that the false-predicated lanes have undefined values. These are syntactic sugar for merge intrinsics with a vuninitializedq inactive parameter.

These predicated intrinsics can also be recognized by their last parameter being of type mve_pred16_t. This is an alias for the uint16_t type. Some predicated intrinsics may have a dedicated first parameter to specify the value in the result vector for the false-predicated lanes; this argument will be of the same type as the result type. For example, v = veorq_m_s8(inactive, a, b, p), will write to each of the sixteen lanes of the result vector v, either the result of the exclusive or between the corresponding lanes of vectors a and b, or the corresponding lane of vector inactive, depending on whether that lane is true- or false-predicated in p. The types of inactive, a, b and v are all int8x16_t in this case and p has type mve_pred16_t.

For the MVE ACLE intrinsics, passing a mask that does not have all bits set to the same value per input-width sized lane to a predicated intrinsic is considered undefined behavior. That is, the user should make sure the producer of the mask uses an element size equal to or higher than the element size of the input vector parameters, other than the ‘inactive’ parameter, of the predicated intrinsic consuming the mask. For example:

mve_pred16_t mask8 = vcmpeqq_u8 (a, b);
uint8x16_t r8 = vaddq_u8 (a,b);         // This is OK.
uint16x8_t r16 = vaddq_u16 (c, d);      // This is UNDEFINED BEHAVIOR.
mve_pred16_t mask8 = 0x5555;            // Predicate every other byte.
uint8x16_t r8 = vaddq_u8 (a,b);         // This is OK.
uint16x8_t r16 = vaddq_u16 (c, d);      // This is UNDEFINED BEHAVIOR.

Users wishing to exploit this predication behavior are encouraged to use inline assembly.

Scalar shift intrinsics

The M-profile Vector Extension (MVE) also provides a set of scalar shift instructions that operate on signed and unsigned double-words and single-words. These shifts can perform additional saturation, rounding, or both. The ACLE for MVE defines intrinsics for these instructions.

Namespace

By default all M-profile Vector Extension intrinsics are available with and without the __arm_ prefix. If the __ARM_MVE_PRESERVE_USER_NAMESPACE macro is defined, the __arm_ prefix is mandatory. This is available to hide the user-namespace-polluting variants of the intrinsics.

Intrinsic polymorphism

The ACLE for the M-profile Vector Extension intrinsics was designed in such a way that it supports a polymorphic implementation of most intrinsics. The polymorphic name of an intrinsic is indicated by leaving out the type suffix enclosed in square brackets, for example the vector addition intrinsic vaddq[_s32] can be called using the function name vaddq. Note that the polymorphism is only possible on input parameter types and intrinsics with the same name must still have the same number of parameters. This is expected to aid implementation of the polymorphism using C11’s _Generic selection.

Vector data types

Vector data types are named as a lane type and a multiple. Lane type names are based on the types defined in <stdint.h>. For example,. int16x8_t is a vector of eight int16_t values. The base types are int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, int64_t, uint64_t, float16_t and float32_t. The multiples are such that the resulting vector types are 128-bit.

Vector array data types

Array types are defined for multiples of 2 and 4 of all the vector types, for use in load and store operations. For a vector type <type>_t the corresponding array type is <type>x<length>_t. Concretely, an array type is a structure containing a single array element called val.

For example, an array of two int16x8_t types is int16x4x8_t, and is represented as:

struct int16x8x2_t { int16x8_t val[2]; };

Scalar data types

For consistency, <arm_mve.h> defines some additional scalar data types to match the vector types.

float32_t is defined as an alias for float, float16_t is defined as an alias for __fp16 and mve_pred16_t is defined as an alias for uint16_t.

Operations on data types

ACLE does not define implicit conversion between different data types. E.g.

int32x4_t x;
uint32x4_t y = x; // No representation change
float32x4_t z = x; // Conversion of integer to floating type

Is not portable. Use the vreinterpretq intrinsics to convert from one vector type to another without changing representation, and use the vcvtq intrinsics to convert between integer and floating types; for example:

int32x4_t x;
uint32x4_t y = vreinterpretq_u32_s32(x);
float32x4_t z = vcvtq_f32_s32(x);

ACLE does not define static construction of vector types. E.g.

int32x4_t x = { 1, 2, 3, 4 };

Is not portable. Use the vcreateq or vdupq intrinsics to construct values from scalars.

In C++, ACLE does not define whether MVE data types are POD types or whether they can be inherited from.

Compatibility with other vector programming models

ACLE does not specify how the MVE Intrinsics interoperate with alternative vector programming models. Consequently, programmers should take particular care when combining the MVE programming model with such programming models.

For example, the GCC vector extensions permit initialising a variable using array syntax, as so

#include "arm_mve.h"
...
uint32x4_t x = {0, 1, 2, 3}; // GCC extension.
uint32_t y = vgetq_lane_s32 (x, 0); // ACLE MVE Intrinsic.

But the definition of the GCC vector extensions is such that the value stored in y will depend on whether the program is running in big- or little-endian mode.

It is recommended that MVE Intrinsics be used consistently:

#include "arm_mve.h"
...
const int temp[4] = {0, 1, 2, 3};
uint32x4_t x = vld1q_s32 (temp);
uint32_t y = vgetq_lane_s32 (x, 0);

Availability of M-profile Vector Extension intrinsics

M-profile Vector Extension support is available if the __ARM_FEATURE_MVE macro has a value other than 0 (see M-profile Vector Extension). The availability of the MVE Floating Point data types and intrinsics are predicated on the value of this macro having bit two set. In order to access the MVE intrinsics, it is necessary to include the <arm_mve.h> header.

#if (__ARM_FEATURE_MVE & 3) == 3
#include <arm_mve.h>
  /* MVE integer and floating point intrinsics are now available to use.  */
#elif __ARM_FEATURE_MVE & 1
#include <arm_mve.h>
  /* MVE integer intrinsics are now available to use.  */
#endif

Specification of M-profile Vector Extension intrinsics

The M-profile Vector Extension intrinsics are specified in the Arm MVE Intrinsics Reference Architecture Specification [MVE].

The behavior of an intrinsic is specified to be equivalent to the MVE instruction it is mapped to in [MVE]. Intrinsics are specified as a mapping between their name, arguments and return values and the MVE instruction and assembler operands which they are equivalent to.

A compiler may make use of the as-if rule from C [C99] (5.1.2.3) to perform optimizations which preserve the instruction semantics.

Undefined behavior

Care should be taken by compiler implementers not to introduce the concept of undefined behavior to the semantics of an intrinsic.

Alignment assertions

The MVE load and store instructions provide for alignment assertions, which may speed up access to aligned data (and will fault access to unaligned data). The MVE intrinsics do not directly provide a means for asserting alignment.

Was this page helpful? Yes No