Why rely on the compiler for auto-vectorization?

Writing hand-optimized assembly kernels or C code containing Neon intrinsics provides a high level of control over the Neon code in your software. However, these methods can result in significant portability and engineering complexity costs.

In many cases a high quality compiler can generate code which is just as good, but requires significantly less design time. The process of allowing the compiler to automatically identify opportunities in your code to use Advanced SIMD instructions is called auto-vectorization.

In terms of specific compilation techniques, auto-vectorization includes:

  • Loop vectorization: unrolling loops to reduce the number of iterations, while performing more operations in each iteration. 
  • Superword-Level Parallelism (SLP) vectorization: bundling scalar operations together to make use of full width Advanced SIMD instructions.

Auto-vectorizing compilers include Arm Compiler 6, Arm C/C++ Compiler, LLVM-clang, and GCC.

The benefits of relying on compiler auto-vectorization include the following:

  • Programs implemented in high level languages are portable, so long as there are no architecture specific code elements such as inline assembly or intrinsics.
  • Modern compilers are capable of performing advanced optimizations automatically.
  • Targeting a given micro-architecture can be as easy as setting a single compiler option, whereas optimizing an assembly program requires deep knowledge of the target hardware.

Auto-vectorization might not be the right choice in all situations, however:

  • While source code can be architecture agnostic, it may have to be compiler specific to get the best code-generation.
  • Small changes in a high-level language or the compiler options can result in significant and unpredictable changes in generated code.

Using the compiler to generate Neon code will be appropriate for most projects. Other methods for exploiting Neon only become necessary when the generated code does not deliver the necessary performance, or when particular hardware features are not supported by high-level languages. For example, configuring system registers to control floating-point functionality must be performed in assembly code.

Previous Next