Auto-vectorization and Helium

There are many different ways to write code that takes advantage of Helium technology. Writing hand-optimized assembly kernels, or C code containing Helium intrinsics, provides a high level of control over the Helium code in your software. However, these methods can result in significant lack of portability and engineering complexity costs.

Often a high-quality compiler can generate code which is just as good, but requires significantly less design time. Auto-vectorization is the process of allowing the compiler to automatically identify opportunities in your code to use Helium instructions.

Auto-vectorization includes the following compilation techniques:

  • Loop vectorization – Unrolling loops to reduce the number of iterations, while performing more operations in each iteration.
  • Superword-Level Parallelism (SLP) vectorization – Bundling scalar operations together to use full width Helium instructions.

Auto-vectorizing compilers for Cortex-M processors include Arm Compiler 6 and LLVM-clang.

The benefits of relying on compiler auto-vectorization include:

  • Programs implemented in high-level languages are portable, if there are no architecture-specific code elements like inline assembly or intrinsics.
  • Modern compilers can perform advanced optimizations automatically.
  • Targeting a given micro-architecture can be as easy as setting a single compiler option. Optimizing an assembly program requires deep knowledge of the target hardware.

However, auto-vectorization might not be the right choice in all situations:

  • While source code can be architecture agnostic, it may have to be compiler specific to get the best code generation.
  • Small changes in a high-level language or the compiler options can result in significant and unpredictable changes in generated code.

Using the compiler to generate Helium instructions is appropriate for most projects. Other methods for exploiting Helium are necessary only when the generated code does not deliver the necessary performance, or when particular hardware features are not supported by high-level languages.

Compiling for Helium with Arm Compiler 6

To enable automatic vectorization, you must specify appropriate compiler options.

These compiler options must do the following:

  • Target a processor that has Helium capabilities
  • Specify an optimization level that includes auto-vectorization

Specifying a Helium-capable target

If you want to run code on one processor, you can target that specific processor with the -mcpu option. Performance is optimized for the micro-architectural specifics of that processor. However, code is only guaranteed to run on that processor.

Alternatively, if you want your code to run on a range of processors, you can target an architecture with the -march option. Generated code runs on any processor implementation of that target architecture, but performance might be impacted.

In both cases, you can use one of the following feature modifiers to enable Helium:

  • +mve enables MVE instructions for integer operations.
  • +mve.fp enables MVE instructions for integer and single-precision floating-point operations.
  • +mve.fp+fp.dp enables MVE instructions for integer, single-precision, and double-precision floating-point operations.

The Helium extension is always enabled on the Cortex-M55, so there is no need to use a feature modifier. Targeting the processor is sufficient to generate Helium code, as in the following command:

armclang --target arm-arm-none-eabi -mcpu=cortex-m55 ...

To target Helium for any Helium-enabled Armv8-M platform, you must specify a feature modifier, as  in the following command:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp ...

Specifying an auto-vectorizing optimization level

Arm Compiler 6 provides a wide range of optimization levels, selected with the -O option.

The following table defines the available optimization levels:

Option Description Auto-vectorization
-O0 Minimum optimization Never
-O1 Restricted optimization Disabled by default.
-O2 High optimization Enabled by default.
-O3 Very high optimization Enabled by default.
-Os Reduce code size, balancing code size against code speed. Enabled by default.
-Oz Smallest possible code size Enabled by default.
-Ofast Optimize for high performance beyond -O3 Enabled by default.
-Omax Optimize for high performance beyond -Ofast Enabled by default.

See Selecting optimization options, in the Arm Compiler User Guide and -O, in the Arm Compiler armclang Reference Guide for more details about these options.

Auto-vectorization is enabled by default at optimization level -O2 and higher. The -fno-vectorize option lets you disable auto-vectorization.

At optimization level -O1, auto-vectorization is disabled by default. The -fvectorize option lets you enable auto-vectorization.

At optimization level -O0, auto-vectorization is always disabled. If you specify the -fvectorize option, the compiler ignores it.

To enable auto-vectorization, do one of the following:

  • Select an optimization level of -O2 or higher.
  • Select an optimization level of -O1 and specify -fvectorize.

Helium auto-vectorization example

The following Helium auto-vectorization example shows how an auto-vectorizing compiler identifies optimization opportunities in source code, and uses Helium instructions to maximize performance.

The example function clips floating point values if they fall outside of a specified range. The function takes the following parameters:

  • *pSrc, a pointer to an array input data
  • *pDst, a pointer to an array where output data will be stored
  • low, the lower bound of the clipping range. Input data values lower than low are replaced with low.
  • high, the upper bound of the clipping range. Input data values higher than high are replaced with high.
  • numSamples, the number of data values in the input array (and therefore also the output array once the function has finished).

The example function is implemented as follows:

#include "arm_math.h"

void arm_clip_f32(float32_t * pSrc, float32_t * pDst, float32_t low, float32_t high,
                      uint32_t numSamples) {
  for (uint32_t i = 0; i < numSamples; i++) {
    if (pSrc[i] > high)
      pDst[i] = high;
    else if (pSrc[i] < low)
      pDst[i] = low;
    else
      pDst[i] = pSrc[i];
  }
}

Compile this code with Arm Compiler 6 as follows:

armclang -target arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp -Ofast
             -S arm_clip_f32.c

In this example, using the -S option means that the compiler outputs the disassembly of the compiled code to the file arm_clip_f32.s.

Examining the arm_clip_f32.s file shows the instructions the compiler has generated:

	  ...
        dls          lr, lr
.LBB0_13:                               @ =>This Inner Loop Header: Depth=1
        vldrw.u32    q3, [r3], #16
        vpt.f32      ge, q3, q2
        vcmpt.f32    ge, q1, q3
        vstr         p0, [sp]           @ 4-byte Spill
        vcmp.f32     gt, q2, q3
        vstr         p0, [sp, #4]       @ 4-byte Spill
        vldr         p0, [sp]           @ 4-byte Reload
        vpst
        vstrwt.32    q3, [r2]
        vldr         p0, [sp, #4]       @ 4-byte Reload
        vpstt
        vcmpt.f32    ge, q1, q3
        vstrwt.32    q2, [r2]
        vpt.f32      gt, q3, q1
        vstrwt.32    q1, [r2], #16
        le           lr, .LBB0_13
	  ...

The Helium instruction VLDRW.U32 loads our data into vector lanes. In this example the data values are 32-bit, so each vector is loaded with four data values at a time.

The VCMP.F32 Helium instructions then compare those vector lanes concurrently against the upper and lower clipping values.

Helium predication instructions such as VPST selectively perform the clipping operation only on data where the comparison reveals that clipping is needed.

Coding best practice for auto-vectorization

As an implementation becomes more complicated, the likelihood that the compiler can auto-vectorize the code decreases.

For example, loops with the following characteristics are particularly difficult, or impossible, to vectorize:

  • Loops with interdependencies between different loop iterations
  • Loops with break clauses
  • Loops with complex conditions

Arm recommends modifying your source code implementation to eliminate these situations where possible.

For example, a necessary condition for auto-vectorization is that the number of iterations in the loop size must be known at the start of the loop. Break conditions mean that the loop size may not be knowable at the start of the loop, which will prevent auto-vectorization. If it is not possible to completely avoid a break condition, it may be worthwhile breaking up the loops into multiple vectorizable and non-vectorizable parts.

A full discussion of the compiler directives that are used to control vectorization of loops can be found in the LLVM-Clang documentation, but the two most important are:

  • #pragma clang loop vectorize(enable)
  • #pragma clang loop interleave(enable)

These pragmas are hints to the compiler to perform Superword Level Parallelism (SLP) and loop vectorization respectively. They are [COMMUNITY] features of Arm Compiler.

More detailed guides covering auto-vectorization are available for the Arm C/C++ Compiler Linux user-space compiler, although many of the points apply across LLVM-Clang variants:

Previous Next