Helium intrinsics

Intrinsics are functions whose precise implementation is known to a compiler.  The Helium intrinsics are a set of C and C++ functions that are defined in the header file arm_mve.h. The Arm compilers and GCC support these intrinsics.

Helium intrinsics provide direct access to Helium instructions from C and C++ code without having to write assembly code by hand. The intrinsics map to short assembly kernels which are inlined into the calling code. Also, the compiler handles register allocation and pipeline optimization. This means that many difficulties that are faced by the assembly programmer are avoided.

See the Arm MVE Intrinsics Reference Architecture specification (also available as interactive HTML) for a list of all the Helium intrinsics. This specification forms part of the Arm C Language Extensions (ACLE).

Using the Helium intrinsics has several benefits:

  • Powerful: Intrinsics give the programmer direct access to the Helium instruction set without the need for hand-written assembly code.
  • Portable: Hand-written Helium assembly instructions might need to be rewritten for different target processors. C and C++ code containing Helium intrinsics can be compiled for a new target with minimal or no code changes.
  • Flexible: The programmer can exploit Helium when needed, or use C/C++ otherwise, while avoiding many low-level engineering concerns.

However, intrinsics might not be the right choice in all situations:

  • It is more difficult to use Helium intrinsics than to import a library or rely on a compiler.
  • Hand-optimized assembly code might offer the greatest scope for performance improvement even if it is more difficult to write.

Helium header file

You should test the __ARM_FEATURE_MVE macro before including the header. The __ARM_FEATURE_MVE macro is a 2-bit bitmap indicating M-profile Vector Extension (MVE) support:

  • Bit 0 indicates whether Helium integer instructions are available.
  • Bit 1 indicates whether Helium floating-point instructions are available.

The valid values of __ARM_FEATURE_MVE are therefore:

  • 0 indicates that Helium is not available.
  • 1 indicates that only the Helium integer intrinsics are available.
  • 3 indicates that both the Helium integer and floating-point intrinsics are available.

The __ARM_FEATURE_MVE macro should be tested to check that Helium is enabled on the target platform before including the header:

#if (__ARM_FEATURE_MVE & 3) == 3 
#include <arm_mve.h> 
     // MVE integer and floating point intrinsics are now available to use. // 
#elif __ARM_FEATURE_MVE & 1 
#include <arm_mve.h>
     // MVE integer intrinsics are now available to use. // 
#endif	

Namespaces

By default, Helium intrinsics occupy both the user namespace and the __arm_ namespace.

That is, both these lines of code are equivalent:

vecDst = vmulq_f32(vecA, vecB);
vecDst = __arm_vmulq_f32(vecA, vecB);

Defining the macro __ARM_MVE_PRESERVE_USER_NAMESPACE hides the definition of the user namespace variants:

#define __ARM_MVE_PRESERVE_USER_NAMESPACE
vecDst = vmulq_f32(vecA, vecB);           //Invalid. User namespace variants are hidden.
vecDst = __arm_vmulq_f32(vecA, vecB);     // Valid.

Compiling code containing Helium intrinsics with Arm Compiler 6

To compile code containing Helium intrinsics, you must do the following:

The preceding steps are the minimum that you must do to enable Helium intrinsics to be compiled into Helium instructions. However, you might also want to have the compiler perform auto-vectorization. This will allow you to identify further opportunities in your code to improve performance with Helium. In this case, specify an appropriate optimization level to enable auto-vectorization.

To target Helium for any Helium-enabled Armv8-M platform:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp ...

Helium intrinsics example

This example shows how you can use Helium intrinsics to perform vector multiplication.

Vector multiplication multiplies the value of the elements in the first source vector by the respective elements in the second source vector, then writes the result to a destination vector register. That is:

[Ai, Aj, Bk, ... ] x [Bi, Bj, Bk, ...] = [ (Ai x Bi) , (Aj x Bj ) , (Ak x Bk) ...]

The main() function does the following:

  • Creates two source data arrays, each containing eight floating-point numbers
  • Calls the my_mult_f32_intr() function, passing the following arguments:
    • Pointers to the two input arrays to use as the data source
    • The block size in memory of the input arrays
    • A pointer to the A_src array, to use as the result destination. The result will therefore overwrite the original data.

The my_mult_f32_intr() function does the following:

  • Uses the block size to calculate how many vector loop iterations are required. Because we are dealing with 32-bit floating-point values, and the Helium registers are 128 bits wide, we can operate on four data values in each iteration.
  • Loads data from the input arrays into the Helium vector registers, four values at a time
  • Performs the vector multiplication on the input vectors
  • Stores the result vector into the destination array
  • Advances the array pointers by the size of four data elements
  • Decrements the loop counter, and loops around until all loop iterations have finished

The my_mult_f32_intr() function is implemented as follows:

#include <stdio.h>
#include <arm_mve.h> 

void my_mult_f32_intr(
                float32_t * pSrcA, float32_t * pSrcB,
                float32_t * pDst, uint32_t blockSize) {

  // Calculate memory block size for 4 x lanes of float32_t data
  const int blkSize_F32 = 4 * sizeof(float32_t);

  // Calculate how many loop iterations are required:
  //    size of array / size of 4 data items
  int blkCnt = blockSize / blkSize_F32;

  // Create source and destination vectors, configured for 4 lanes of float32_t data
  float32x4_t vecA, vecB, vecDst;


  // Main loop
  while (blkCnt > 0U) {
    // Load source vectors with data from the input arrays
    vecA = vldrwq_f32(pSrcA);
    vecB = vldrwq_f32(pSrcB);

    // Perform vector multiplication
    vecDst = vmulq_f32(vecA, vecB);

    // Store the result vector into the destination array
    vstrwq_f32(pDst,  vecDst);

    // Decrement the loop count
    blkCnt--;

    // Advance source and destination pointer addresses by the size of 4 data elements
    pSrcA += blkSize_F32;
    pSrcB += blkSize_F32;
    pDst += blkSize_F32;
  }
}

int main() {
  // Setup data in input arrays
  float32_t A_src[] = {1.1, 7.9, 8.2, 2.1, 5.3, 2.2, 3.1, 6.9};
  float32_t B_src[] = {7.2, 2.7, 9.9, 8.2, 1.3, 1.1, 6.9, 2.4};

  // Call the multiplication function
  my_mult_f32_intr(&A_src[0], &B_src[0], &A_src[0], sizeof(A_src));

  return 0;
}

The following table shows some additional information about the intrinsics that are used:

Intrinsic Description
vldrwq_f32 Loads consecutive elements from memory into a destination vector register.
vmulq_f32 Multiplies the value of the elements in the first source vector register by the respective elements in the second source vector register. The result is then written to the destination vector register.
vstrwq_f32 Stores consecutive elements to memory from a vector register.

You can compile this code with Arm Compiler 6 as follows:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp 
            -Ofast  -S my_mult_f32_intr.c

In this example, using the -S option means that the compiler outputs the disassembly of the compiled code to the file my_mult_f32_intr.s.

Examining the my_mult_f32_intr.s file shows the instructions that the compiler has generated:

	  ...
.LBB0_1:                                @ =>This Inner Loop Header: Depth=1
        vldrw.u32       q0, [r1]
        vldrw.u32       q1, [r0]
        adds            r1, #64
        adds            r0, #64
        vmul.f32        q0, q1, q0
        vstrw.32        q0, [r2]
        adds            r2, #64
        le              lr, .LBB0_1
	  ...

Here we can see that:

  • The vldrwq_f32 intrinsics compile to vldrw.u32 instructions.
  • The vmulq_f32 intrinsic compiles to a vmul.f32 instruction.
  • The vstrwq_f32 intrinsic compiles to a vstrw.32 instruction.
Previous Next