Helium intrinsics
Intrinsics are functions whose precise
implementation is known to a compiler. The Helium intrinsics are a set of C
and C++ functions that are defined in the header file arm_mve.h
. The Arm
compilers and GCC support these intrinsics.
Helium intrinsics provide direct access to Helium instructions from C and C++ code without having to write assembly code by hand. The intrinsics map to short assembly kernels which are inlined into the calling code. Also, the compiler handles register allocation and pipeline optimization. This means that many difficulties that are faced by the assembly programmer are avoided.
See the Arm MVE Intrinsics Reference Architecture specification (also available as interactive HTML) for a list of all the Helium intrinsics. This specification forms part of the Arm C Language Extensions (ACLE).
Using the Helium intrinsics has several benefits:
- Powerful: Intrinsics give the programmer direct access to the Helium instruction set without the need for hand-written assembly code.
- Portable: Hand-written Helium assembly instructions might need to be rewritten for different target processors. C and C++ code containing Helium intrinsics can be compiled for a new target with minimal or no code changes.
- Flexible: The programmer can exploit Helium when needed, or use C/C++ otherwise, while avoiding many low-level engineering concerns.
However, intrinsics might not be the right choice in all situations:
- It is more difficult to use Helium intrinsics than to import a library or rely on a compiler.
- Hand-optimized assembly code might offer the greatest scope for performance improvement even if it is more difficult to write.
Helium header file
The header file arm_mve.h
defines the
Helium intrinsics. You must include this header file in every source file that
uses Helium intrinsics.
You should test the __ARM_FEATURE_MVE
macro before including the header. The __ARM_FEATURE_MVE
macro is a 2-bit bitmap indicating M-profile Vector Extension (MVE) support:
- Bit 0 indicates whether Helium integer instructions are available.
- Bit 1 indicates whether Helium floating-point instructions are available.
The valid values of __ARM_FEATURE_MVE
are
therefore:
- 0 indicates that Helium is not available.
- 1 indicates that only the Helium integer intrinsics are available.
- 3 indicates that both the Helium integer and floating-point intrinsics are available.
The __ARM_FEATURE_MVE
macro should be tested
to check that Helium is enabled on the target platform before including the
header:
#if (__ARM_FEATURE_MVE & 3) == 3 #include <arm_mve.h> // MVE integer and floating point intrinsics are now available to use. // #elif __ARM_FEATURE_MVE & 1 #include <arm_mve.h> // MVE integer intrinsics are now available to use. // #endif
Namespaces
By default, Helium intrinsics occupy both the
user namespace and the __arm_
namespace.
That is, both these lines of code are equivalent:
vecDst = vmulq_f32(vecA, vecB); vecDst = __arm_vmulq_f32(vecA, vecB);
Defining the macro __ARM_MVE_PRESERVE_USER_NAMESPACE
hides the definition of the user namespace variants:
#define __ARM_MVE_PRESERVE_USER_NAMESPACE vecDst = vmulq_f32(vecA, vecB); //Invalid. User namespace variants are hidden. vecDst = __arm_vmulq_f32(vecA, vecB); // Valid.
Compiling code containing Helium intrinsics with Arm Compiler 6
To compile code containing Helium intrinsics, you must do the following:
- Include
the Helium intrinsics header file
arm_mve.h
in your code - Specify compiler options that identify a target with Helium capabilities.
The preceding steps are the minimum that you must do to enable Helium intrinsics to be compiled into Helium instructions. However, you might also want to have the compiler perform auto-vectorization. This will allow you to identify further opportunities in your code to improve performance with Helium. In this case, specify an appropriate optimization level to enable auto-vectorization.
To target Helium for any Helium-enabled Armv8-M platform:
armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp ...
Helium intrinsics example
This example shows how you can use Helium intrinsics to perform vector multiplication.
Vector multiplication multiplies the value of the elements in the first source vector by the respective elements in the second source vector, then writes the result to a destination vector register. That is:
[Ai, Aj, Bk, ... ] x [Bi, Bj, Bk, ...] = [ (Ai x Bi) , (Aj x Bj ) , (Ak x Bk) ...]
The main()
function does
the following:
- Creates two source data arrays, each containing eight floating-point numbers
-
Calls the
my_mult_f32_intr()
function, passing the following arguments:- Pointers to the two input arrays to use as the data source
- The block size in memory of the input arrays
- A pointer to the
A_src
array, to use as the result destination. The result will therefore overwrite the original data.
The my_mult_f32_intr()
function does the following:
- Uses the block size to calculate how many vector loop iterations are required. Because we are dealing with 32-bit floating-point values, and the Helium registers are 128 bits wide, we can operate on four data values in each iteration.
- Loads data from the input arrays into the Helium vector registers, four values at a time
- Performs the vector multiplication on the input vectors
- Stores the result vector into the destination array
- Advances the array pointers by the size of four data elements
- Decrements the loop counter, and loops around until all loop iterations have finished
The my_mult_f32_intr()
function is implemented as follows:
#include <stdio.h> #include <arm_mve.h> void my_mult_f32_intr( float32_t * pSrcA, float32_t * pSrcB, float32_t * pDst, uint32_t blockSize) { // Calculate memory block size for 4 x lanes of float32_t data const int blkSize_F32 = 4 * sizeof(float32_t); // Calculate how many loop iterations are required: // size of array / size of 4 data items int blkCnt = blockSize / blkSize_F32; // Create source and destination vectors, configured for 4 lanes of float32_t data float32x4_t vecA, vecB, vecDst; // Main loop while (blkCnt > 0U) { // Load source vectors with data from the input arrays vecA = vldrwq_f32(pSrcA); vecB = vldrwq_f32(pSrcB); // Perform vector multiplication vecDst = vmulq_f32(vecA, vecB); // Store the result vector into the destination array vstrwq_f32(pDst, vecDst); // Decrement the loop count blkCnt--; // Advance source and destination pointer addresses by the size of 4 data elements pSrcA += blkSize_F32; pSrcB += blkSize_F32; pDst += blkSize_F32; } } int main() { // Setup data in input arrays float32_t A_src[] = {1.1, 7.9, 8.2, 2.1, 5.3, 2.2, 3.1, 6.9}; float32_t B_src[] = {7.2, 2.7, 9.9, 8.2, 1.3, 1.1, 6.9, 2.4}; // Call the multiplication function my_mult_f32_intr(&A_src[0], &B_src[0], &A_src[0], sizeof(A_src)); return 0; }
The following table shows some additional information about the intrinsics that are used:
Intrinsic | Description |
---|---|
vldrwq_f32
|
Loads consecutive elements from memory into a destination vector register. |
vmulq_f32
|
Multiplies the value of the elements in the first source vector register by the respective elements in the second source vector register. The result is then written to the destination vector register. |
vstrwq_f32
|
Stores consecutive elements to memory from a vector register. |
You can compile this code with Arm Compiler 6 as follows:
armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp -Ofast -S my_mult_f32_intr.c
In this example, using the -S
option
means that the compiler outputs the disassembly of the compiled code to the
file my_mult_f32_intr.s
.
Examining the my_mult_f32_intr.s
file shows the instructions that the compiler has generated:
... .LBB0_1: @ =>This Inner Loop Header: Depth=1 vldrw.u32 q0, [r1] vldrw.u32 q1, [r0] adds r1, #64 adds r0, #64 vmul.f32 q0, q1, q0 vstrw.32 q0, [r2] adds r2, #64 le lr, .LBB0_1 ...
Here we can see that:
- The
vldrwq_f32
intrinsics compile tovldrw.u32
instructions. - The
vmulq_f32
intrinsic compiles to avmul.f32
instruction. - The
vstrwq_f32
intrinsic compiles to avstrw.32
instruction.