You copied the Doc URL to your clipboard.

Use pragmas to control auto-vectorization

Arm® C/C++ Compiler supports pragmas to both encourage and suppress auto-vectorization. These pragmas make use of, and extend, the pragma clang loop directives.

For more information about the pragma clang loop directives, see .

Note

In each of the following examples, the pragma only affects the loop statement immediately following it. If your code contains multiple nested loops, you must insert a pragma before each one in order to affect all the loops in the nest.

Encourage auto-vectorization with pragmas

If auto-vectorization is enabled with the optimization option -O2 or higher, by default, it examines all loops.

If static analysis of a loop indicates that it might contain dependencies that hinder parallelism, auto-vectorization might not be performed. If you know that these dependencies do not hinder vectorization, use the vectorize pragma to inform the compiler.

To use the vectorize pragma, insert the following line immediately before the loop:

#pragma clang loop vectorize(assume_safety)

The pragma above indicates to the compiler that the following loop contains no data dependencies between loop iterations that would prevent vectorization. The compiler might be able to use this information to vectorize a loop, where it would not typically be possible.

Note

The vectorize pragma does not guarantee auto-vectorization. There might be other reasons why auto-vectorization is not possible or worthwhile for a particular loop.

Warning

Ensure that you only use this pragma when it is safe to do so. Using the vectorize pragma when there are data dependencies between loop iterations might result in incorrect behavior.

For example, consider the following loop, that processes an array indices. Each element in indices specifies the index into a larger histogram array. The referenced element in the histogram array is incremented.

void update(int *restrict histogram, int *restrict indices, int count)
{
  for (int i = 0; i < count; i++)
  {
    histogram[ indices[i] ]++;
  }
}

The compiler is unable to vectorize this loop, because the same index could appear more than once in the indices array. Therefore, a vectorized version of the algorithm would lose some of the increment operations if two identical indices are processed in the same vector load/increment/store sequence.

However, if you know that the indices array only ever contains unique elements, then it is useful to be able to force the compiler to vectorize this loop. This is accomplished by placing the vectorize pragma before the loop:

void update_unique(int *restrict histogram, int *restrict indices, int count)
{
  #pragma clang loop vectorize(assume_safety)
  for (int i = 0; i < count; i++)
  {
    histogram[ indices[i] ]++;
  }
}

Suppress auto-vectorization with pragmas

If auto-vectorization is not required for a specific loop, you can disable it or restrict it to only use Arm SIMD (Neon) instructions.

To suppress auto-vectorization on a specific loop, add #pragma clang loop vectorize(disable) immediately before the loop.

In this example, a loop that would be trivially vectorized by the compiler is ignored:

void combine_arrays(int *restrict a, int *restrict b, int count)
{
  #pragma clang loop vectorize(disable)
  for ( int i = 0; i < count; i++ )
  {
    a[i] = b[i] + 1;
  }
}

You can also suppress SVE instructions while allowing Arm Neon instructions by adding a vectorize_style hint:

vectorize_style(fixed_width)

Prefer fixed-width vectorization, resulting in Arm Neon instructions. For a loop with vectorize_style(fixed_width), the compiler prefers to generate Arm Neon instructions, though SVE instructions might still be used with a fixed-width predicate (such as gather loads or scatter stores).

vectorize_style(scaled_width) (default)

Prefer scaled-width vectorization, resulting in SVE instructions. For a loop with vectorize_style(scaled_width), the compiler prefers SVE instructions but can choose to generate Arm Neon instructions or not vectorize at all.

For example:

void combine_arrays(int *restrict a, int *restrict b, int count)
{
  #pragma clang loop vectorize(enable) vectorize_style(fixed_width)
  for ( int i = 0; i < count; i++ )
  {
    a[i] = b[i] + 1;
  }
}

Unrolling and interleaving with pragmas

To better use processor resources, duplicate loops to reduce the loop iteration count and increase the Instruction-Level Parallelism (ILP). For scalar loops, the method is called unrolling. For vectorizable loops, it is interleaving that is performed.

Unrolling

Unrolling a scalar loop, for example:

for (int i = 0; i < 64; i++) {
  data[i] = input[i] * other[i];
}

by a factor of two, gives:

for (int i = 0; i < 32; i +=2) {
  data[i] = input[i] * other[i];
  data[i+1] = input[i+1] * other[i+1];
}

For the example above, the unrolling factor (UF) is two. To unroll to the internal limit, the unroll pragma is inserted before the loop:

#pragma clang loop unroll(enable)

To unroll to a user-defined UF, instead insert:

#pragma clang loop unroll_count(_value_)

Interleaving

To interleave, an Interleaving Factor (IF) is used instead of a UF. To accurately generate interleaved code, the loop vectorizer models the cost on the register pressure and the generated code size. When a loop is vectorized, the interleaved code can be more optimal than unrolled code.

Like the UF, the IF can be the internal limit or a user-defined integer. To interleave to the internal limit, the interleave pragma is inserted before the loop:

#pragma clang loop interleave(enable)

To interleave to a user-defined IF, instead insert:

#pragma clang loop interleave_count(_value_)

Note

Interleaving performed on a scalar loop will not unroll the loop correctly.

Was this page helpful? Yes No