Overview

This guide provides information and examples for software programmers who want to use Arm Helium technology. We will discuss the benefits and drawbacks of the different approaches available, and examine real-world code examples to help you understand the key issues.

Arm Helium technology is the M-Profile Vector Extension (MVE) for the Arm Cortex-M processor series. Helium is an extension of the Armv8.1-M architecture and delivers a significant performance uplift for machine learning (ML) and digital signal processing (DSP) applications. For introductory information about Helium, please see the Introduction to Helium guide.

Before you begin

This guide forms part of the Helium Programmer’s Guide. If you are not familiar with Helium, you should start by reading the Introduction to Helium guide.

The sections of this guide contain code examples. These code examples are available to download as a ZIP file.

The examples in this guide use the Arm Compiler 6 toolchain, designed for embedded application development running on bare-metal devices. If you do not already have access to Arm Compiler 6, it is included in the 30-day free trial of Arm Development Studio Gold Edition.

Options for writing Helium-enabled code

Programming in any high-level language is a tradeoff between the ease of writing code, and the amount of control that you have over the low-level instructions that are output by the compiler. This is true when targeting Helium-enabled hardware. The goal is to ensure that wherever your code contains vectorization opportunities where operations could be performed in parallel, Helium instructions are used.

At one end of the spectrum, you could write all your code in standard C/C++ and leave the implementation decisions to the compiler. If you are using an auto-vectorizing compiler, and your code is straightforward, this can produce excellent results. The compiler generates Helium instructions for all the vectorizable portions of your code.

The benefit of this approach is that it requires very little effort from the programmer, except for writing standard C/C++ code.

The drawback of this approach is that, if the compiler does not do what you want, for whatever reason, you might not have enough control to change that situation. For example, if your code is complex, the compiler might miss a vectorization opportunity and fail to use Helium. Modifying your code to follow best practices might be enough to help the compiler identify the vectorization opportunity, but you cannot be sure.

At the other end of the spectrum, you could write all your Helium code by hand in assembly. This gives you full control over the instructions used, but at the cost of vastly increased programmer effort.

The different options available for writing Helium-enabled code are:

  • Helium-enabled libraries
  • Auto-vectorization
  • Helium intrinsics
  • Assembly code

Helium-enabled libraries

Libraries that support Helium provide one of the easiest ways to take advantage of Helium.

Libraries provide a suite of functions that you can use in your own code. When you compile for a Helium-enabled target, a library variant using Helium instructions is selected. When you compile for a target that does not support Helium, a library variant using standard Arm instructions is selected. This means that the same source code can easily be compiled for both Helium-enabled targets and non-Helium-enabled targets.

Examples of Helium-enabled libraries include:

  • CMSIS-DSP –  A suite of common signal processing functions for use on Cortex-M processor-based devices.
  • CMSIS-NN – A collection of efficient neural network kernels that are developed to maximize the performance, and minimize the memory footprint, of neural networks on Cortex-M processor cores.

Libraries are easy to incorporate into your code, and the implementations of the functions have already been optimized. For example, CMSIS-DSP has been designed to provide many of the functions that you would need to write signal-processing code like audio filters or  Fast Fourier Transform (FFT).

The disadvantage of libraries is that you only have access to the functions that the library designer has provided.

Auto-vectorization

Auto-vectorization features in your compiler can automatically optimize your code to take advantage of Helium.

Auto-vectorization means allowing the compiler to automatically identify the areas of your code that would benefit from Single Instruction Multiple Data (SIMD) optimizations.

The benefit of using auto-vectorization is that the programmer leaves everything to the compiler.

The disadvantage of auto-vectorization is that, if the compiler does not do what you want, you might not have enough control to change that situation. For example, the compiler might fail to identify that a particular part of your code is vectorizable. You can use coding best practices to help the compiler identify that code is vectorizable, but they might not be enough to guide the compiler in the right direction. In these situations, you might toned to use other options, for example intrinsics or inline assembly, to ensure that Helium instructions are used.

Helium intrinsics

Helium intrinsics are function calls that the compiler replaces with appropriate Helium instructions. Using Helium intrinsics gives you direct, low-level access to the exact Helium instructions that you want, all from C/C++ code.

The benefit of using intrinsics is that they provide almost as much control as writing assembly language, but leave details like register allocation to the compiler, so that developers can focus on the algorithms.

The disadvantage of using Helium intrinsics is that programming with intrinsics can be more complex than writing standard C/C++ code, and requires the programmer to learn about the available Helium intrinsics.

Assembly code

For very high performance, hand-coded Helium assembly code is an alternative approach for experienced programmers.

You can use pure assembly code modules (.s files) in your code, or you can use inline assembly code to embed assembler instructions in your C and C++ code.

The benefit of using assembly code is that it provides absolute control over the Helium instructions that are used.

The disadvantage of using assembly code is that writing assembly code can be a very complex process that most people would rather not have to do. Optimizing hand-written assembly code often requires detailed knowledge of the target hardware pipeline, especially for in-order Cortex- M processors. You might need to write and maintain different code variants for different targets to achieve optimal performance.

Enabling Helium

Helium is an optional extension to the Armv8.1-M architecture. This means that Helium may or may not be present on your target.

Because of this, you should check whether Helium is available on your target before running Helium code. This applies whether you are using a software model or a hardware target.

Enabling Helium on the ARM_AEMv8M Fixed Virtual Platform

FVP models provide a number of different parameters to configure the optional features of the model.

See the Fast Models Reference Manual for a complete list of all ARM_AEMv8M parameters.

To enable Helium on the ARM_AEMv8M FVP, set the following parameters:

  • cpu0.enable_helium_extension=1
  • cpu0.vfp-present=1
  • cpu0.vfp-enable_at_reset=1

Checking if Helium is present for hardware targets

The Media and VFP Feature Register 1 (MVFR1), describes the features that are provided by the floating-point extension. In particular, the MVFR1.MVE bitfields, bits [11:8], indicate support for the M-profile vector extension.

The possible values of this field are:

  • 0b0000 indicates that Helium instructions are not available.
  • 0b0001 indicates that Helium integer instructions are available, but Helium floating-point instructions are not available.
  • 0b0010 indicates that Helium integer and floating-point instructions are available.

For more information, see the Armv8-M Architecture Reference Manual.

The __ARM_FEATURE_MVE macro provides another mechanism for checking whether Helium is present.

Enabling Helium for hardware targets

For hardware targets, access control registers specify whether Helium instructions are available from privileged and unprivileged code.

  • The Non-secure Access Control Register (NSACR) specifies the Non-secure access permissions for Helium in bitfields CP10 and CP11.
    The possible values of these fields are:
    • 0 - Non-secure accesses to the Floating-point Extension or MVE, unless otherwise specified, generate a NOCP UsageFault.
    • 1 - Non-secure access to the Floating-point Extension or MVE is permitted.
  • The Coprocessor Access Control Register (CPACR) specifies the access privileges for Helium in bit field CP10.
    The possible values of this field are:
    • 0b00 - All accesses to the FP Extension and MVE result in NOCP UsageFault.
    • 0b01 - Unprivileged accesses to the FP Extension and MVE result in NOCP UsageFault.
    • 0b11 - Full access to the FP Extension and MVE.

For more information, see the Armv8-M Architecture Reference Manual.     

Helium-enhanced libraries

Libraries provide a suite of functions you can use in your own code.  Helium-enahanced libraries provide implementations of those functions that use Helium instructions.

When you compile for a Helium-enabled target, a library variant using Helium instructions is selected. When you compile for a target that does not support Helium, a library variant using standard Arm instructions is selected. This means that you can easily compile the same source code for both Helium and non-Helium enabled targets.

Examples of Helium-enabled libraries include:

  • CMSIS-DSP –   A suite of common signal processing functions for use on Cortex-M processor-based devices
  • CMSIS-NN –   A collection of efficient neural network kernels that are developed to maximize the performance, and minimize the memory footprint, of neural networks on Cortex-M processor cores

In this section of the guide, we examine the CMSIS-DSP library to see how to use libraries to write Helium-enabled code.

Getting the CMSIS-DSP library

The CMSIS-DSP library provides functions that are specifically designed for signal processing. The library provides over sixty common signal processing and mathematical functions for various data types.

CMSIS is the Arm Cortex Microcontroller Software Interface Standard. CMSIS provides a vendor-independent hardware abstraction layer for microcontrollers that are based on Arm Cortex processors. CMSIS defines generic tool interfaces and enables consistent device support. Its software interfaces simplify software reuse, reduce the learning curve for microcontroller developers, and improve time to market for new devices.

CMSIS is integrated into IDEs like Keil MDK and Arm Development Studio, but can also be used with a standalone compiler.

To use CMSIS:

Writing code using the CMSIS-DSP library

The CMSIS-DSP pack includes various examples. For example, the variance example demonstrates the use of basic math functions to calculate the variance of an input sequence.

Let's examine the source code of this example to look at some key features:

#include "arm_math.h"

The preceding code shows the CMSIS-DSP header file that declares the basic math functions used by the example. This code must be included to use the CMSIS-DSP functions.

The following code shows how the variance example uses CMSIS-DSP library functions to implement an algorithm that calculates the statistical variance of an input stream. The CMSIS-DSP functions are arm_fill_f32(), arm_dot_prod_f32(), and arm_mult_f32()

.
.
.
/* Calculation of mean value of input */
/* x' = 1/blockSize * (x(0)* 1 + x(1) * 1 + ... + x(n-1) * 1) */
/* Fill wire1 buffer with 1.0 value */
arm_fill_f32(1.0,  wire1, blockSize);

/* Calculate the dot product of wire1 and wire2 */
/* (x(0)* 1 + x(1) * 1 + ...+ x(n-1) * 1) */
arm_dot_prod_f32(testInput_f32, wire1, blockSize, &mean);

/* Calculation of 1/blockSize */
oneByBlockSize = 1.0 / (blockSize);

/* 1/blockSize * (x(0)* 1 + x(1) * 1 + ... + x(n-1) * 1)  */
arm_mult_f32(&mean, &oneByBlockSize, &mean, 1);

.
.
.

The CMSIS-DSP functions used in this example are:

All available CMSIS-DSP functions are described in the CMSIS DSP Software Library Reference.

Compiling CMSIS-DSP code for Helium

When compiling for a Helium-enabled target, the compiler will automatically select the CMSIS-DSP variant that uses Helium instructions.

For example, in Arm Development Studio select Generic Armv8.1-M Main (MVE Integer) to target any Helium-enabled Armv8-M platform, as shown in the following screenshot:

When compiling with a standalone compiler, you must ensure that the CMSIS header files are on the include path, and the CMSIS libraries are on the library path.

For example, to target the architecture, use this command:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp
            -I <cmsis_dir>/DSP/Include/ -L <cmsis_dir>/DSP/Lib/ ...

To target the Cortex-M55, use this command:

armclang -target arm-arm-none-eabi -mcpu=cortex-m55 -I <cmsis_dir>/DSP/Include/
            -L <cmsis_dir>/DSP/Lib/ ...

Auto-vectorization and Helium

There are many different ways to write code that takes advantage of Helium technology. Writing hand-optimized assembly kernels, or C code containing Helium intrinsics, provides a high level of control over the Helium code in your software. However, these methods can result in significant lack of portability and engineering complexity costs.

Often a high-quality compiler can generate code which is just as good, but requires significantly less design time. Auto-vectorization is the process of allowing the compiler to automatically identify opportunities in your code to use Helium instructions.

Auto-vectorization includes the following compilation techniques:

  • Loop vectorization – Unrolling loops to reduce the number of iterations, while performing more operations in each iteration.
  • Superword-Level Parallelism (SLP) vectorization – Bundling scalar operations together to use full width Helium instructions.

Auto-vectorizing compilers for Cortex-M processors include Arm Compiler 6 and LLVM-clang.

The benefits of relying on compiler auto-vectorization include:

  • Programs implemented in high-level languages are portable, if there are no architecture-specific code elements like inline assembly or intrinsics.
  • Modern compilers can perform advanced optimizations automatically.
  • Targeting a given micro-architecture can be as easy as setting a single compiler option. Optimizing an assembly program requires deep knowledge of the target hardware.

However, auto-vectorization might not be the right choice in all situations:

  • While source code can be architecture agnostic, it may have to be compiler specific to get the best code generation.
  • Small changes in a high-level language or the compiler options can result in significant and unpredictable changes in generated code.

Using the compiler to generate Helium instructions is appropriate for most projects. Other methods for exploiting Helium are necessary only when the generated code does not deliver the necessary performance, or when particular hardware features are not supported by high-level languages.

Compiling for Helium with Arm Compiler 6

To enable automatic vectorization, you must specify appropriate compiler options.

These compiler options must do the following:

  • Target a processor that has Helium capabilities
  • Specify an optimization level that includes auto-vectorization

Specifying a Helium-capable target

If you want to run code on one processor, you can target that specific processor with the -mcpu option. Performance is optimized for the micro-architectural specifics of that processor. However, code is only guaranteed to run on that processor.

Alternatively, if you want your code to run on a range of processors, you can target an architecture with the -march option. Generated code runs on any processor implementation of that target architecture, but performance might be impacted.

In both cases, you can use one of the following feature modifiers to enable Helium:

  • +mve enables MVE instructions for integer operations.
  • +mve.fp enables MVE instructions for integer and single-precision floating-point operations.
  • +mve.fp+fp.dp enables MVE instructions for integer, single-precision, and double-precision floating-point operations.

The Helium extension is always enabled on the Cortex-M55, so there is no need to use a feature modifier. Targeting the processor is sufficient to generate Helium code, as in the following command:

armclang --target arm-arm-none-eabi -mcpu=cortex-m55 ...

To target Helium for any Helium-enabled Armv8-M platform, you must specify a feature modifier, as  in the following command:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp ...

Specifying an auto-vectorizing optimization level

Arm Compiler 6 provides a wide range of optimization levels, selected with the -O option.

The following table defines the available optimization levels:

Option Description Auto-vectorization
-O0 Minimum optimization Never
-O1 Restricted optimization Disabled by default.
-O2 High optimization Enabled by default.
-O3 Very high optimization Enabled by default.
-Os Reduce code size, balancing code size against code speed. Enabled by default.
-Oz Smallest possible code size Enabled by default.
-Ofast Optimize for high performance beyond -O3 Enabled by default.
-Omax Optimize for high performance beyond -Ofast Enabled by default.

See Selecting optimization options, in the Arm Compiler User Guide and -O, in the Arm Compiler armclang Reference Guide for more details about these options.

Auto-vectorization is enabled by default at optimization level -O2 and higher. The -fno-vectorize option lets you disable auto-vectorization.

At optimization level -O1, auto-vectorization is disabled by default. The -fvectorize option lets you enable auto-vectorization.

At optimization level -O0, auto-vectorization is always disabled. If you specify the -fvectorize option, the compiler ignores it.

To enable auto-vectorization, do one of the following:

  • Select an optimization level of -O2 or higher.
  • Select an optimization level of -O1 and specify -fvectorize.

Helium auto-vectorization example

The following Helium auto-vectorization example shows how an auto-vectorizing compiler identifies optimization opportunities in source code, and uses Helium instructions to maximize performance.

The example function clips floating point values if they fall outside of a specified range. The function takes the following parameters:

  • *pSrc, a pointer to an array input data
  • *pDst, a pointer to an array where output data will be stored
  • low, the lower bound of the clipping range. Input data values lower than low are replaced with low.
  • high, the upper bound of the clipping range. Input data values higher than high are replaced with high.
  • numSamples, the number of data values in the input array (and therefore also the output array once the function has finished).

The example function is implemented as follows:

#include "arm_math.h"

void arm_clip_f32(float32_t * pSrc, float32_t * pDst, float32_t low, float32_t high,
                      uint32_t numSamples) {
  for (uint32_t i = 0; i < numSamples; i++) {
    if (pSrc[i] > high)
      pDst[i] = high;
    else if (pSrc[i] < low)
      pDst[i] = low;
    else
      pDst[i] = pSrc[i];
  }
}

Compile this code with Arm Compiler 6 as follows:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp -Ofast
             -S arm_clip_f32.c

In this example, using the -S option means that the compiler outputs the disassembly of the compiled code to the file arm_clip_f32.s.

Examining the arm_clip_f32.s file shows the instructions the compiler has generated:

	  ...
        dls          lr, lr
.LBB0_13:                               @ =>This Inner Loop Header: Depth=1
        vldrw.u32    q3, [r3], #16
        vpt.f32      ge, q3, q2
        vcmpt.f32    ge, q1, q3
        vstr         p0, [sp]           @ 4-byte Spill
        vcmp.f32     gt, q2, q3
        vstr         p0, [sp, #4]       @ 4-byte Spill
        vldr         p0, [sp]           @ 4-byte Reload
        vpst
        vstrwt.32    q3, [r2]
        vldr         p0, [sp, #4]       @ 4-byte Reload
        vpstt
        vcmpt.f32    ge, q1, q3
        vstrwt.32    q2, [r2]
        vpt.f32      gt, q3, q1
        vstrwt.32    q1, [r2], #16
        le           lr, .LBB0_13
	  ...

The Helium instruction VLDRW.U32 loads our data into vector lanes. In this example the data values are 32-bit, so each vector is loaded with four data values at a time.

The VCMP.F32 Helium instructions then compare those vector lanes concurrently against the upper and lower clipping values.

Helium predication instructions such as VPST selectively perform the clipping operation only on data where the comparison reveals that clipping is needed.

Coding best practice for auto-vectorization

As an implementation becomes more complicated, the likelihood that the compiler can auto-vectorize the code decreases.

For example, loops with the following characteristics are particularly difficult, or impossible, to vectorize:

  • Loops with interdependencies between different loop iterations
  • Loops with break clauses
  • Loops with complex conditions

Arm recommends modifying your source code implementation to eliminate these situations where possible.

For example, a necessary condition for auto-vectorization is that the number of iterations in the loop size must be known at the start of the loop. Break conditions mean that the loop size may not be knowable at the start of the loop, which will prevent auto-vectorization. If it is not possible to completely avoid a break condition, it may be worthwhile breaking up the loops into multiple vectorizable and non-vectorizable parts.

A full discussion of the compiler directives that are used to control vectorization of loops can be found in the LLVM-Clang documentation, but the two most important are:

  • #pragma clang loop vectorize(enable)
  • #pragma clang loop interleave(enable)

These pragmas are hints to the compiler to perform Superword Level Parallelism (SLP) and loop vectorization respectively. They are [COMMUNITY] features of Arm Compiler.

More detailed guides covering auto-vectorization are available for the Arm C/C++ Compiler Linux user-space compiler, although many of the points apply across LLVM-Clang variants:

Helium intrinsics

Intrinsics are functions whose precise implementation is known to a compiler.  The Helium intrinsics are a set of C and C++ functions that are defined in the header file arm_mve.h. The Arm compilers and GCC support these intrinsics.

Helium intrinsics provide direct access to Helium instructions from C and C++ code without having to write assembly code by hand. The intrinsics map to short assembly kernels which are inlined into the calling code. Also, the compiler handles register allocation and pipeline optimization. This means that many difficulties that are faced by the assembly programmer are avoided.

See the Arm MVE Intrinsics Reference Architecture specification (also available as interactive HTML) for a list of all the Helium intrinsics. This specification forms part of the Arm C Language Extensions (ACLE).

Using the Helium intrinsics has several benefits:

  • Powerful: Intrinsics give the programmer direct access to the Helium instruction set without the need for hand-written assembly code.
  • Portable: Hand-written Helium assembly instructions might need to be rewritten for different target processors. C and C++ code containing Helium intrinsics can be compiled for a new target with minimal or no code changes.
  • Flexible: The programmer can exploit Helium when needed, or use C/C++ otherwise, while avoiding many low-level engineering concerns.

However, intrinsics might not be the right choice in all situations:

  • It is more difficult to use Helium intrinsics than to import a library or rely on a compiler.
  • Hand-optimized assembly code might offer the greatest scope for performance improvement even if it is more difficult to write.

Helium header file

You should test the __ARM_FEATURE_MVE macro before including the header. The __ARM_FEATURE_MVE macro is a 2-bit bitmap indicating M-profile Vector Extension (MVE) support:

  • Bit 0 indicates whether Helium integer instructions are available.
  • Bit 1 indicates whether Helium floating-point instructions are available.

The valid values of __ARM_FEATURE_MVE are therefore:

  • 0 indicates that Helium is not available.
  • 1 indicates that only the Helium integer intrinsics are available.
  • 3 indicates that both the Helium integer and floating-point intrinsics are available.

The __ARM_FEATURE_MVE macro should be tested to check that Helium is enabled on the target platform before including the header:

#if (__ARM_FEATURE_MVE & 3) == 3 
#include <arm_mve.h> 
     // MVE integer and floating point intrinsics are now available to use. // 
#elif __ARM_FEATURE_MVE & 1 
#include <arm_mve.h>
     // MVE integer intrinsics are now available to use. // 
#endif	

Namespaces

By default, Helium intrinsics occupy both the user namespace and the __arm_ namespace.

That is, both these lines of code are equivalent:

vecDst = vmulq_f32(vecA, vecB);
vecDst = __arm_vmulq_f32(vecA, vecB);

Defining the macro __ARM_MVE_PRESERVE_USER_NAMESPACE hides the definition of the user namespace variants:

#define __ARM_MVE_PRESERVE_USER_NAMESPACE
vecDst = vmulq_f32(vecA, vecB);           //Invalid. User namespace variants are hidden.
vecDst = __arm_vmulq_f32(vecA, vecB);     // Valid.

Compiling code containing Helium intrinsics with Arm Compiler 6

To compile code containing Helium intrinsics, you must do the following:

The preceding steps are the minimum that you must do to enable Helium intrinsics to be compiled into Helium instructions. However, you might also want to have the compiler perform auto-vectorization. This will allow you to identify further opportunities in your code to improve performance with Helium. In this case, specify an appropriate optimization level to enable auto-vectorization.

To target Helium for any Helium-enabled Armv8-M platform:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp ...

Helium intrinsics example

This example shows how you can use Helium intrinsics to perform vector multiplication.

Vector multiplication multiplies the value of the elements in the first source vector by the respective elements in the second source vector, then writes the result to a destination vector register. That is:

[Ai, Aj, Bk, ... ] x [Bi, Bj, Bk, ...] = [ (Ai x Bi) , (Aj x Bj ) , (Ak x Bk) ...]

The main() function does the following:

  • Creates two source data arrays, each containing eight floating-point numbers
  • Calls the my_mult_f32_intr() function, passing the following arguments:
    • Pointers to the two input arrays to use as the data source
    • The block size in memory of the input arrays
    • A pointer to the A_src array, to use as the result destination. The result will therefore overwrite the original data.

The my_mult_f32_intr() function does the following:

  • Uses the block size to calculate how many vector loop iterations are required. Because we are dealing with 32-bit floating-point values, and the Helium registers are 128 bits wide, we can operate on four data values in each iteration.
  • Loads data from the input arrays into the Helium vector registers, four values at a time
  • Performs the vector multiplication on the input vectors
  • Stores the result vector into the destination array
  • Advances the array pointers by the size of four data elements
  • Decrements the loop counter, and loops around until all loop iterations have finished

The my_mult_f32_intr() function is implemented as follows:

#include <stdio.h>
#include <arm_mve.h> 

void my_mult_f32_intr(
                float32_t * pSrcA, float32_t * pSrcB,
                float32_t * pDst, uint32_t blockSize) {

  // Calculate memory block size for 4 x lanes of float32_t data
  const int blkSize_F32 = 4 * sizeof(float32_t);

  // Calculate how many loop iterations are required:
  //    size of array / size of 4 data items
  int blkCnt = blockSize / blkSize_F32;

  // Create source and destination vectors, configured for 4 lanes of float32_t data
  float32x4_t vecA, vecB, vecDst;


  // Main loop
  while (blkCnt > 0U) {
    // Load source vectors with data from the input arrays
    vecA = vldrwq_f32(pSrcA);
    vecB = vldrwq_f32(pSrcB);

    // Perform vector multiplication
    vecDst = vmulq_f32(vecA, vecB);

    // Store the result vector into the destination array
    vstrwq_f32(pDst,  vecDst);

    // Decrement the loop count
    blkCnt--;

    // Advance source and destination pointer addresses by the size of 4 data elements
    pSrcA += blkSize_F32;
    pSrcB += blkSize_F32;
    pDst += blkSize_F32;
  }
}

int main() {
  // Setup data in input arrays
  float32_t A_src[] = {1.1, 7.9, 8.2, 2.1, 5.3, 2.2, 3.1, 6.9};
  float32_t B_src[] = {7.2, 2.7, 9.9, 8.2, 1.3, 1.1, 6.9, 2.4};

  // Call the multiplication function
  my_mult_f32_intr(&A_src[0], &B_src[0], &A_src[0], sizeof(A_src));

  return 0;
}

The following table shows some additional information about the intrinsics that are used:

Intrinsic Description
vldrwq_f32 Loads consecutive elements from memory into a destination vector register.
vmulq_f32 Multiplies the value of the elements in the first source vector register by the respective elements in the second source vector register. The result is then written to the destination vector register.
vstrwq_f32 Stores consecutive elements to memory from a vector register.

You can compile this code with Arm Compiler 6 as follows:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp 
            -Ofast  -S my_mult_f32_intr.c

In this example, using the -S option means that the compiler outputs the disassembly of the compiled code to the file my_mult_f32_intr.s.

Examining the my_mult_f32_intr.s file shows the instructions that the compiler has generated:

	  ...
.LBB0_1:                                @ =>This Inner Loop Header: Depth=1
        vldrw.u32       q0, [r1]
        vldrw.u32       q1, [r0]
        adds            r1, #64
        adds            r0, #64
        vmul.f32        q0, q1, q0
        vstrw.32        q0, [r2]
        adds            r2, #64
        le              lr, .LBB0_1
	  ...

Here we can see that:

  • The vldrwq_f32 intrinsics compile to vldrw.u32 instructions.
  • The vmulq_f32 intrinsic compiles to a vmul.f32 instruction.
  • The vstrwq_f32 intrinsic compiles to a vstrw.32 instruction.

Mixing C/C++ and Helium assembly code

Arm Compiler 6 provides an inline assembler that enables you to incorporate assembly code directly in your C or C++ source code. The benefit of using inline assembly rather than writing pure assembly code is that you can read and write C variables directly.

Introduction to inline assembly

The __asm keyword provides the mechanism to incorporate inline GCC syntax assembly code into a function.

For example, here is a simple example that uses a single inline ADD assembly instruction:

#include <stdio.h>

int add(int i, int j)
{
  int res = 0;
  __asm (
    "ADD %[result], %[input_i], %[input_j]"
    : [result] "=r" (res)
    : [input_i] "r" (i), [input_j] "r" (j)
  );
  return res;
}

int main(void)
{
  int a = 1;
  int b = 2;
  int c = 0;

  c = add(a,b);

  printf("Result of %d + %d = %d\n", a, b, c);
}

The general form of the __asm inline assembly statement is:

__asm(
 	code
 	[: output_operand_list 
 	[: input_operand_list 
 	[: clobbered_register_list]]]
);

Where:

  • code is the assembly code.
    In our example, this is:
    "ADD %[result], %[input_i], %[input_j]"
    This single instruction adds two inputs, represented by the symbolic names input_i and input_j, and assigns the sum to the symbolic name result.
  • output_operand_list maps output symbolic names to C variable names.
    In our example, there is just one output:
    : [result]"=r" (res)
    This indicates that the symbolic name result maps to the C variable res. The "=r" constraint indicates that value resides in a register, and that any previous value in that register is overwritten.
  • input_operand_list maps input symbolic names to C variable names.
    In our example, there are two inputs:
    : [input_i] "r" (i), [input_j] "r" (j)
    This indicates that the symbolic name input_i maps to the C variable i, and input_j maps to j.
  • clobbered_register_list is a comma-separated list of register names. These are registers that the assembly code potentially modifies, but for which the final value is not important. Including a register in the clobber list prevents the compiler from using that register for other purposes in the code.
    In our example, there are no clobbered registers, because only the previously declared input and result registers are used.

Constraints and modifiers let you tell the compiler how values will be used in the assembly code. In our example, we used the "=r"constraint to identify that the output register is overridden. There are many other constraints that let you identify more complex requirements. For example, you can use the %Q and %R constraint modifiers to access the lower and higher halves of a 64-bit register pair.

The Arm Compiler Reference Guide: armclang Inline Assembler provides more information about inline assembly, constraints, and modifiers.

For more information about GCC and the format used by the __asm keyword, see the Gnu documentation.

Complex vector dot product example

Helium is ideally suited for the types of operations that are commonly performed by DSP and ML applications.

This example shows how you can use inline assembly and Helium instructions to calculate the vector dot product of two arrays of complex numbers.

Complex numbers have two parts: real and imaginary, expressed as:

a + bi

Each complex number is therefore represented as a pair of numbers, a and b.

The example uses the Q31 fixed-point numbers format to represent the data. In the Q31 format, 1 bit is used to represent the sign (0 for positive, 1 for negative) and the remaining 31 bits represent the fractional data. The Q31 format can therefore express numbers in the range -1 to almost 1.

For example:

Q31 value (decimal) Q31 value (hex) Floating-point value
0 0x0000 0000 0
1 0x0000 0001 0.0000000004656613
984267707 0x3AAA BBBB 0.4583353674970567
2147483647 0x7FFF FFFF‬ 0.9999999995343387
-2,147,483,648 0x8000 0000 -1
-1,288,490,189 0xB333 3333 -0.6000000000931323
-1 0xFFFF FFFF -0.0000000004656613

For two complex numbers:

a + bi
c + di

The vector dot product is calculated as:

((a x c) – (b x d)) + ((a x d) + (b x c))i
~~~~~~~~~~~~~~~~~~~   ~~~~~~~~~~~~~~~~~~~
         ^                     ^
         |                     |
     Real part           Imaginary part

To calculate the dot product for vectors of complex numbers, the individual dot products for each input pair are summed.

That is, the underlying algorithm is:

realResult = 0;
imagResult = 0;
for (n = 0; n < numSamples; n++) {
    realResult += pSrcA[(2*n)+0] * pSrcB[(2*n)+0] - pSrcA[(2*n)+1] * pSrcB[(2*n)+1];
    imagResult += pSrcA[(2*n)+0] * pSrcB[(2*n)+1] + pSrcA[(2*n)+1] * pSrcB[(2*n)+0];
}

Where:

  • realResult is the real component of the summed dot products.
  • imagResult is the imaginary component of the summed dot products.
  • pSrcA and pSrcB are pointers to the two input arrays. Each input array contains complex numbers with the real and imaginary parts interleaved. That is, the array {1, 2, 3, 4, 5, 6} represents the three complex numbers 1 + 2i, 3 + 4i, 5 + 6i
  • numSamples is the number of complex number pairs in each input array.

We can implement this algorithm using inline assembly code using Helium instructions as follows:

void my_cmplx_dot_prod_q31(
  q31_t * pSrcA,
  q31_t * pSrcB,
  uint32_t numSamples,
  q63_t * realResult,
  q63_t * imagResult)
{
  __asm volatile (
        "   clrm                    {r4-r7}                 \n"
        "   wlstp.32                lr, %[cnt], 1f          \n"
        "2:                                                 \n"
        "   vldrw.32                q0, [%[pA]], 16         \n"
        "   vldrw.32                q1, [%[pB]], 16         \n"
        "   vrmlsldavha.s32         r4, r5, q0, q1          \n"
        "   vrmlaldavhax.s32        r6, r7, q0, q1          \n"
        "   letp                    lr, 2b                  \n"
        "1:                                                 \n"
        "   asrl                    r4, r5, #6              \n"
        "   asrl                    r6, r7, #6              \n"
        "   strd                    r4, r5, [%[realResult]] \n"
        "   strd                    r6, r7, [%[imagResult]] \n"
        :[pA] "+r"(pSrcA),[pB] "+r"(pSrcB)
        :[cnt] "r"(numSamples * 2), [realResult] "r"(realResult),
         [imagResult] "r"(imagResult)
        :"r4", "r5", "r6", "r7", "lr", "memory");
}

Key features of this code include:

  • The WLSTP (While Loop Start with Tail Predication) and LETP (Loop End) instructions form the main loop. This loop iterates over all elements of the input arrays, decrementing numSamples by one on each loop and continuing until it reaches zero.
  • The VLDRW (Vector Load Register) instruction loads the next two complex numbers from each array into Helium registers Q0 and Q1.
  • The real component of the dot product is calculated by the VRMLSLDAVHA (Vector Rounding Multiply Subtract Long Dual Accumulate Across Vector Returning High 64 bits) instruction. This instruction multiplies corresponding elements from the vectors in the registers Q0 and Q1. The results of the pairs of multiply instructions are subtracted from each other. Finally, the scalar result is then added to the running total that is held in the two registers R5 (high 32 bits) and R4 (low 32 bits).
    This implements the following part of the algorithm:
    realResult += pSrcA[(2*n)+0] * pSrcB[(2*n)+0] - pSrcA[(2*n)+1] * pSrcB[(2*n)+1];
    The following diagram illustrates this calculation:
  • The imaginary component of the dot product is calculated by the VRMLALDAVHAX (Vector Rounding Multiply Subtract Long Dual Accumulate Across Vector Returning High 64 bits With Exchange) instruction. This instruction first swaps the values in each pair of values read from the first source register Q1, before multiplying them with the values from the second source register Q0. The results of the pairs of multiply operations are combined by adding them together. Finally, the scalar result is then added to the running total held in the two registers R7 (high 32 bits) and R6 (low 32 bits).
    This implements the following part of the algorithm:
    imagResult += pSrcA[(2*n)+0] * pSrcB[(2*n)+1] + pSrcA[(2*n)+1] * pSrcB[(2*n)+0];
    The following diagram illustrates this calculation:
  • The ASRL (Arithmetic Shift Right Long) instruction performs an arithmetic shift right by 6 bits, to convert the result into a Q16.48 number.
  • The STRD (Store Register Dual) instruction returns the real and imaginary components of the result by writing them to the result registers.

The following complete code example shows how to interface between the C and assembly code:

#include <arm_mve.h>
#include <arm_math.h>
#include <stdio.h>

void my_cmplx_dot_prod_q31(
  q31_t * pSrcA,
  q31_t * pSrcB,
  uint32_t numSamples,
  q63_t * realResult,
  q63_t * imagResult)
{
  __asm volatile (
        "   clrm                    {r4-r7}                 \n"
        "   wlstp.32                lr, %[cnt], 1f          \n"
        "2:                                                 \n"
        "   vldrw.32                q0, [%[pA]], 16         \n"
        "   vldrw.32                q1, [%[pB]], 16         \n"
        "   vrmlsldavha.s32         r4, r5, q0, q1          \n"
        "   vrmlaldavhax.s32        r6, r7, q0, q1          \n"
        "   letp                    lr, 2b                  \n"
        "1:                                                 \n"
        "   asrl                    r4, r5, #6              \n"
        "   asrl                    r6, r7, #6              \n"
        "   strd                    r4, r5, [%[realResult]] \n"
        "   strd                    r6, r7, [%[imagResult]] \n"
        :[pA] "+r"(pSrcA),[pB] "+r"(pSrcB)
        :[cnt] "r"(numSamples * 2), [realResult] "r"(realResult),
         [imagResult] "r"(imagResult)
        :"r4", "r5", "r6", "r7", "lr", "memory");
}



int main() {
  // Setup data in input arrays
  q31_t A_src[] = {947483647, 834662098, 111222333, 555666777, 101202303, 
                      555000222, 432654876, 999888777};
  q31_t B_src[] = {147483647, 623333999, 623957233, 876543098, 337744884, 
                      112233445, 909808707, 543098765};

  // Create pointers to the two input arrays
  q31_t *pA_src = A_src;
  q31_t *pB_src = B_src;

  // Setup result variables
  q63_t res_real;
  q63_t res_imag;

  // Create pointers to the two result variables
  q63_t *pres_real = &res_real;
  q63_t *pres_imag = &res_imag;


  // Get the number of elements in the array
  int num_array_elements = sizeof(A_src) / sizeof(q31_t);

  // Divide by 2 to get the number of vector elements 
  //  (each vector element is a pair: a real and a complex component)
  int num_vector_elements = num_array_elements / 2;

  // Call the dot product function, reusing one of the input arrays as the result array
  my_cmplx_dot_prod_q31(pA_src, pB_src, num_vector_elements, pres_real, pres_imag);

  // Print the result
  printf("\n\nreal=%lld ; complex=%lld\n", (long long)res_real, (long long)res_imag);

 return 0;
}

You can compile this code with Arm Compiler 6 as follows:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp test.c

Further reading

Helium assembly code

Most programmers should not need to write their own Helium assembly code. Relying on an auto-vectorizing compiler, using Helium-enabled libraries, or using Helium intrinsics should all be considered in preference to hand-crafting assembly code.

However, for marginal cases, hand-coded Helium assembly code can be an alternative approach for experienced programmers.

Introduction to mixed C and assembly code programming

To code a function in Helium assembly language, we use two source files:

  • A .s file containing the assembly code function
  • A .c file containing the C code that calls the assembly code function

We then call the assembly function from our C code.

So that the C code can call the assembly function, the following requirements must be met:

  • The calling C code must:
    • Declare the external assembly function using the extern keyword
  • The called assembly function must:
    • Declare itself as a global function, using the .globl and .type directives
    • Conform to the Procedure Call Standard for the Arm Architecture (AAPCS), for example using registers R0 through R3 for input parameters, and R0 for any return value

For example, the following assembly code in myadd.s implements a function myadd():

 
     .globl   myadd
     .p2align 2
     .type    myadd,%function

myadd:                     // Function "myadd" entry point.
     .fnstart
     add      r0, r0, r1   // Arguments in R0 and R1. Add and put the result in R0.
     bx       lr           // Return by branching to the address in the link register.
     .fnend

The following C code in myadd.c then calls the assembly function myadd():

#include <stdio.h>

extern int myadd(int a, int b);

int main()
{
        int a = 4;
        int b = 5;
        printf("Adding %d and %d results in %d\n", a, b, myadd(a, b));
        return (0);
}

This example code can be compiled with Arm Compiler 6 as follows:

armclang --target=arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp myadd.c myadd.s

For more information, see the following resources:

Mixed C and assembly code Helium example

The following example uses Helium assembly code to implement a function my_maximum_u32(). This function computes the maximum value of an array of data.

First we will look at the C code that calls the function, then we will look at the assembly code that implements the function.

The C code in max.c does the following:

  • Declares the my_maximum_u32()function as external using the extern keyword.
    The specification of the function shows that it takes the following arguments:
    • pSrc, a pointer to the input data array
    • blkCnt, the number of data elements in the array
    • pResult, a pointer to the integer result which will hold the maximum value on return
  • Creates an array of integer data, and a pointer to that array.
  • Creates an integer variable for the result, and a pointer to that variable.
  • Calculates the number of elements in the array by comparing the size of the array with the size of the integer data type.
  • Calls the my_maximum_u32()function.
  • Prints the result.

The example code is as follows:

#include <stdio.h>
					
extern int my_maximum_u32(unsigned int * pSrc, 
 				  unsigned int blkCnt, 
 				  unsigned int * pResult);
					
int main()
{
  // Setup data in input array.
  unsigned int A_src[] = {1, 1, 3, 4, 2, 1, 1, 3, 8, 6, 2, 6, 6, 1, 3, 2, 9, 1};
  unsigned int  *pA_src = A_src;

  // Setup result variable
  unsigned int result;
  unsigned int *pResult = &result;

  // Get the number of elements in the array
  unsigned int numElements = sizeof(A_src) / sizeof(int);

  // Call the maximum function
  my_maximum_u32(pA_src, numElements, pResult);

  // Print the result
  printf("Maximum = %d\n", result);

}

The assembly code in max.s does the following:

  • Defines the label my_maximum_u32 as a global function using the .globl and .type directives.
  • On entry to the function, pushes the link register (LR) to the stack to preserve it for later.
    We need to preserve the link register because we need to use it for the low overhead loop that is coming soon.
    The instruction also pushes the register R7 to the stack. We do not need to preserve R7, but the easiest way to preserve the required 8-byte stack alignment is to push and pop registers in pairs.
  • vmov initializes the Helium Q0 vector register to 4 x 32-bit lanes, and sets all 4 lanes to zero.
  • wlstp initializes a loop, setting LR to the total number of data elements as specified by the function argument in register R1.
  • vldrw loads the four lanes of vector register Q0, with the next four values pointed to by the data pointer address in R0. We use the post-increment variant of the instruction to increment the data pointer by 16 (4 x 32 bits = 16 bytes), so that on the next loop iteration we will read the next four data values.
  • vmax compares each of the four lanes in Q1 to the corresponding lane in Q0, and stores the maximum for each lane in Q0.
    On the first loop, because we initialized all lanes of Q0 to zero, any nonzero values in Q1 become the new maximum values.
    On subsequent loops, lanes in Q1 are only updated when new maximum values are discovered.
  • letp marks the end of the loop that is initialized by wlstp, exiting the loop when LR equals zero.
    On each iteration, letp decrements the loop counter in LR by the number of elements in the vector.
    For all iterations except the last, four elements are processed, so that the LR is decremented by four.
    On the final iteration, there may be fewer than four elements to process. In this case, tail predication is used to deal with the data residual when the number of elements to be processed is not an exact multiple of the number of elements in the vector.
  • After all data elements have been processed, vmaxv finds the maximum value across all lanes of vector register Q0, storing that maximum value in R0.
  • Finally, the maximum value is stored in the address that is specified by the parameter in R2. The function returns by popping the previously preserved LR value into the Program Counter PC.

The example assembly code is as follows:

			
my_maximum_u32:                    // Function "my_maximum_u32" entry point.
     .fnstart
     push       {r7, lr}           // Save LR
     vmov.i32   q0, #0x0           // Initialize Q0 
     wlstp.32   lr, r1, end        // Loop Start... 
loop:                                
     vldrw.u32  q1, [r0], #16      //    Load 4 values into vector and increment pointer
     vmax.u32   q0, q1, q0         //    Find maximum
     letp       lr, loop           // ...Loop End
end:
     movs       r0, #0             // Initialize return value to zero
     vmaxv.u32  r0, q0             // Get maximum across all lanes in result vector
     str        r0, [r2]           // Store at address of result pointer
     pop        {r7, pc}           // Return by setting PC to saved LR
     .fnend

This example code can be compiled with Arm Compiler 6 as follows:

armclang --target=arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp max.c max.s

Next steps

This guide has introduced a number of different techniques for writing Helium-enabled code.

After reading this guide, you will be ready to start writing your own code. The examples supplied with this guide are a good place to start learning. Further examples can be found by examining the source code for the various examples in the CMSIS-DSP pack.