Using pragmas to control auto-vectorization

Arm C/C++ Compiler for HPC supports pragmas to both encourage and suppress auto-vectorization. These pragmas make use of, and extend, the pragma clang loop directives.

For more information about the pragma clang loop directives, see Auto-Vectorization in LLVM, at llvm.org.

Note: In all the following cases, the pragma only affects the loop statement immediately following it. If your code contains multiple nested loops, you must insert a pragma before each one in order to affect all the loops in the nest.

Encouraging auto-vectorization with pragmas

If SVE auto-vectorization is enabled with -O2 or above, then by default it examines all loops.

If static analysis of a loop indicates that it might contain dependencies that hinder parallelism, auto-vectorization might not be performed. If you know that these dependencies do not hinder vectorization, you can use the vectorize directive to indicate this to the compiler by placing the following line immediately before the loop:

#pragma clang loop vectorize(assume_safety)

This pragma indicates to the compiler that the following loop contains no data dependencies between loop iterations that would prevent vectorization. The compiler might be able to use this information to vectorize a loop, where it would not typically be possible.

Note: Use of this pragma does not guarantee auto-vectorization. There might be other reasons why auto-vectorization is not possible or worthwhile for a particular loop.

Ensure that you only use this pragma when it is safe to do so. Using this pragma when there are data dependencies between loop iterations may result in incorrect behavior.

For example, consider the following loop, that processes an array indices. Each element in indices specifies the index into a larger histogram array. The referenced element in the histogram array is incremented.

void update(int *restrict histogram, int *restrict indices, int count)
{
  for (int i = 0; i < count; i++)
  {
    histogram[ indices[i] ]++;
  }
}

The compiler is unable to vectorize this loop, because the same index could appear more than once in the indices array. Therefore a vectorized version of the algorithm would lose some of the increment operations if two identical indices are processed in the same vector load/increment/store sequence.

However, if the programmer knows that the indices array only ever contains unique elements, then it is useful to be able to force the compiler to vectorize this loop. This is accomplished by placing the pragma before the loop:

void update_unique(int *restrict histogram, int *restrict indices, int count)
{
  #pragma clang loop vectorize(assume_safety)
  for (int i = 0; i < count; i++)
  {
    histogram[ indices[i] ]++;
  }
}

Suppressing auto-vectorization with pragmas

If SVE auto-vectorization is not required for a specific loop, you can disable it or restrict it to only use Arm SIMD (NEON) instructions.

You can suppress auto-vectorization on a specific loop by adding #pragma clang loop vectorize(disable) immediately before the loop. In this example, a loop that would be trivially vectorized by the compiler is ignored:

void combine_arrays(int *restrict a, int *restrict b, int count)
{
  #pragma clang loop vectorize(disable)
  for ( int i = 0; i < count; i++ )
  {
    a[i] = b[i] + 1;
  }
}

You can also suppress SVE instructions while allowing Arm NEON instructions by adding a vectorize_style hint:

vectorize_style(fixed_width)

Prefer fixed-width vectorization, resulting in Arm NEON instructions. For a loop with vectorize_style(fixed_width), the compiler prefers to generate Arm NEON instructions, though SVE instructions may still be used with a fixed-width predicate (such as gather loads or scatter stores).

vectorize_style(scaled_width)

Prefer scaled-width vectorization, resulting in SVE instructions. For a loop with vectorize_style(scaled_width), the compiler prefers SVE instructions but can choose to generate Arm NEON instructions or not vectorize at all. This is the default.

For example:

void combine_arrays(int *restrict a, int *restrict b, int count)
{
  #pragma clang loop vectorize(enable) vectorize_style(fixed_width)
  for ( int i = 0; i < count; i++ )
  {
    a[i] = b[i] + 1;
  }
}

Unrolling and interleaving with pragmas

To enable better use of processor resources, loops can be duplicated to reduce the loop iteration count and increase the instruction-level parallelism (ILP). For scalar loops, the method is called unrolling. For vectorizable loops, interleaving is performed.

Unrolling

Unrolling a scalar loop, for example:

for (int i = 0; i < 64; i++) {
  data[i] = input[i] * other[i];
}

by a factor of two, gives:

for (int i = 0; i < 32; i +=2) {
  data[i] = input[i] * other[i];
  data[i+1] = input[i+1] * other[i+1];
}

For this example, two is the unrolling factor (UF). To unroll to the internal limit, the following pragma is inserted before the loop:

#pragma clang loop unroll(enable)

To unroll to a user-defined UF, instead insert:

#pragma clang loop unroll_count(_value_)

Interleaving

To interleave, an interleaving factor (IF) is used instead of a UF. To accurately generate interleaved code, the loop vectorizer models the cost on the register pressure and the generated code size. When a loop is vectorized, the interleaved code can be more optimal than unrolled code.

Like the UF, the IF can be the internal limit or a user-defined integer. To interleave to the internal limit, the following pragma is inserted before the loop:

#pragma clang loop interleave(enable)

To interleave to a user-defined IF, instead insert:

#pragma clang loop interleave_count(_value_)

Note: Interleaving performed on a scalar loop will not unroll the loop correctly.