Compiler optimization

Arm Compiler for Linux can automatically generate code that contains Neon and SVE instructions. Allowing the compiler to automatically identify opportunities in your code to use Neon or SVE instructions is called auto-vectorization.

Auto-vectorization includes the following specific compilation techniques:

  • Loop vectorization: Unrolling loops to reduce the number of iterations, while performing more operations in each iteration
  • Superword-Level Parallelism (SLP) vectorization: Bundling scalar operations together to use full width Advanced SIMD instructions

The benefits of relying on compiler auto-vectorization include the following:

  • Programs implemented in high-level languages are portable, if there are no architecture-specific code elements like inline assembly or intrinsics.
  • Modern compilers can perform advanced optimizations automatically.
  • Targeting a given micro-architecture can be as easy as setting a single compiler option. However, hand-optimizing a program in assembly requires deep knowledge of the target hardware.

The following table shows how to compile for AArch64 with Arm Compiler for Linux:

Extension Header file form Recommended Arm Compiler for Linux command line Notes
Neon #include <arm_neon.h> armclang -O<level> -mcpu={native|<target>} -o <binary_name> <filename>.c

To take advantage of micro-architectural optimizations, set -mcpu to the target processor that your application runs on. If the target processor is the same processor that you are compiling your code on, set -mcpu=native. This setting allows the compiler to automatically detect your processor.

-march=armv8-a is also supported, but would not include the micro-architectural optimizations. For more information about architectural compiler flags and data to support this recommendation, see the Compiler flags across architectures: -march, -mtune, and -mcpu blog.

SVE #ifdef __ARM_FEATURE_SVE #include <arm_sve.h> #endif armclang -O<level> -march=armv8-a+sve -o <binary_name> <filename>.c

The -march=armv8-a+sve option specifies that the compiler optimizes for Armv8-A hardware. You can then use Arm Instruction Emulator to emulate the SVE instructions.

When SVE-enabled hardware is available and you are compiling on that target SVE hardware, Arm recommends using -mcpu=native instead. Using -mcpu=native allows you to take advantage of micro-architectural optimizations.

The following table shows the supported optimization levels for -O<level> for both Neon and SVE code:

Option Description Auto-vectorization
-O0 Minimum optimization for the performance of the compiled binary. Turns off most optimizations. When debugging is enabled, this option generates code that directly corresponds to the source code. Therefore, this option might result in a significantly larger image. This is the default optimization level. Never
-O1 Restricted optimization. When debugging is enabled, this option gives the best debug view for the trade-off between image size, performance, and debug. Disabled by default.
-O2 High optimization. When debugging is enabled, the debug view might be less satisfactory. This is because the mapping of object code to source code is not always clear. The compiler might perform optimizations that cannot be described by debug information. Enabled by default.
-O3 Very high optimization. When debugging is enabled, this option typically gives a poor debug view. Arm recommends debugging at lower optimization levels. Enabled by default.
-Ofast Enable all the optimizations from level 3, including those performed with the -ffp-mode=fast armclang option. This level also performs other aggressive optimizations that might violate strict compliance with language standards. Enabled by default.

Auto-vectorization is enabled by default at optimization level -O2 and higher. The -fno-vectorize option lets you disable auto-vectorization.

At optimization level -O0, auto-vectorization is always disabled. If you specify the -fvectorize option, the compiler ignores it.

At optimization level -O1, auto-vectorization is disabled by default. The -fvectorize option lets you enable auto-vectorization.

As an implementation becomes more complicated, the likelihood that the compiler can auto-vectorize the code decreases. For example, loops with the following characteristics are particularly difficult, or impossible, to vectorize:

  • Loops with interdependencies between different loop iterations
  • Loops with break clauses
  • Loops with complex conditions

Neon and SVE have different conditions for auto-vectorization. For example, a necessary condition for auto-vectorizing Neon code is that the number of iterations in the loop size must be known at the start of the loop, at execution time. However, knowing the number of iterations in the loop size is not required to auto-vectorize SVE code.

Note: Break conditions mean the loop size might not be knowable at the start of the loop, which prevents auto-vectorization for Neon code. If it is not possible to completely avoid a break condition, consider splitting the loops into multiple vectorizable and non-vectorizable parts.

You can find a full discussion of the compiler directives used to control vectorization of loops in the LLVM-Clang documentation. The two most important directives are:

  • #pragma clang loop vectorize(enable)
  • #pragma clang loop interleave(enable)

These pragmas are hints to the compiler to perform SLP and Loop vectorization, respectively. More detailed information about auto-vectorization is available in the Arm C/C++ Compiler and Arm Fortran Compiler Reference guides.

Previous Next