You copied the Doc URL to your clipboard.

Optimizing loops

Loops can take a significant amount of time to complete depending on the number of iterations in the loop. The overhead of checking a condition for each iteration of the loop can degrade the performance of the loop.

Loop unrolling

You can reduce the impact of this overhead by unrolling some of the iterations, which in turn reduces the number of iterations for checking the condition. Use #pragma unroll (n) to unroll time-critical loops in your source code. However, unrolling loops has the disadvantage of increasing the code size. These pragmas are only effective at optimization -O2, -O3, -Ofast, and -Omax.

Table 3-3 Loop unrolling pragmas

Pragma Description
#pragma unroll (n) Unroll n iterations of the loop.
#pragma unroll_completely Unroll all the iterations of the loop.

Note

Manually unrolling loops in source code might hinder the automatic rerolling of loops and other loop optimizations by the compiler. Arm recommends that you use #pragma unroll instead of manually unrolling loops. See #pragma unroll[(n)], #pragma unroll_completely in the Arm® Compiler Reference Guide for more information.

The following examples show code with loop unrolling and code without loop unrolling.

Table 3-4 Loop optimizing example

Bit counting loop without unrolling Bit counting loop with unrolling
int countSetBits1(unsigned int n)
{
    int bits = 0;

    while (n != 0)
    {
        if (n & 1) bits++;
        n >>= 1;
    }
    return bits;
}
int countSetBits2(unsigned int n)
{
    int bits = 0;
    #pragma unroll (4) 
    while (n != 0)
    {
        if (n & 1) bits++;
        n >>= 1;
    }
    return bits;
}

The following code is the code that Arm Compiler generates for the preceding examples. Copy the examples into file.c and compile using:

armclang --target=arm-arm-none-eabi -march=armv8-a file.c -O2 -c -S -o file.s

For the function with loop unrolling, countSetBits2, the generated code is faster but larger in size.

Table 3-5 Loop examples

Bit counting loop without unrolling Bit counting loop with unrolling
countSetBits1:       
        mov     r1, r0
        mov     r0, #0
        cmp     r1, #0
        bxeq    lr
        mov     r2, #0
        mov     r0, #0
.LBB0_1:                                
        and     r3, r1, #1
        cmp     r2, r1, asr #1
        add     r0, r0, r3
        lsr     r3, r1, #1
        mov     r1, r3
        bne     .LBB0_1
        bx      lr
countSetBits2:                              
        mov     r1, r0
        mov     r0, #0
        cmp     r1, #0
        bxeq    lr
        mov     r2, #0
        mov     r0, #0
LBB0_1:                                
        and     r3, r1, #1
        cmp     r2, r1, asr #1
        add     r0, r0, r3
        beq     .LBB0_4
@ BB#2:                                 
        asr     r3, r1, #1
        cmp     r2, r1, asr #2
        and     r3, r3, #1
        add     r0, r0, r3
        asrne   r3, r1, #2
        andne   r3, r3, #1
        addne   r0, r0, r3
        cmpne   r2, r1, asr #3
        beq     .LBB0_4
@ BB#3:                                
        asr     r3, r1, #3
        cmp     r2, r1, asr #4
        and     r3, r3, #1
        add     r0, r0, r3
        asr     r3, r1, #4
        mov     r1, r3
        bne     .LBB0_1
.LBB0_4:
        bx	lr

Arm Compiler can unroll loops completely only if the number of iterations is known at compile time.

Loop vectorization

If your target has the Advanced SIMD unit, then Arm Compiler can use the vectorizing engine to optimize vectorizable sections of the code. At optimization level -O1, you can enable vectorization using -fvectorize. At higher optimizations, -fvectorize is enabled by default and you can disable it using -fno-vectorize. See -fvectorize in the Arm® Compiler Reference Guide for more information. When using -fvectorize with -O1, vectorization might be inhibited in the absence of other optimizations which might be present at -O2 or higher.

For example, loops that access structures can be vectorized if all parts of the structure are accessed within the same loop rather than in separate loops. The following examples show a loop that Advanced SIMD can vectorize, and a loop that cannot be vectorized easily.

Table 3-6 Example loops

Vectorizable by Advanced SIMD Not vectorizable by Advanced SIMD
typedef struct tBuffer {
  int a;
  int b;
  int c;
} tBuffer;
tBuffer buffer[8];

void DoubleBuffer1 (void)
{
  int i;
  for (i=0; i<8; i++)
  {
    buffer[i].a *= 2;
    buffer[i].b *= 2;
    buffer[i].c *= 2;
  }
}
typedef struct tBuffer {
  int a;
  int b;
  int c;
} tBuffer;
tBuffer buffer[8];

void DoubleBuffer2 (void)
{
  int i;
  for (i=0; i<8; i++)
    buffer[i].a *= 2;
  for (i=0; i<8; i++)
    buffer[i].b *= 2; 
  for (i=0; i<8; i++)
    buffer[i].c *= 2;
}

For each example, copy the code into file.c and compile at optimization level O2 to enable auto-vectorization:

armclang --target=arm-arm-none-eabi -march=armv8-a -O2 file.c -c -S -o file.s

The vectorized assembly code contains the Advanced SIMD instructions, for example vld1, vshl, and vst1. These Advanced SIMD instructions are not generated when compiling the example with the non-vectorizable loop.

Table 3-7 Assembly code from vectorizable and non-vectorizable loops

Vectorized assembly code Non-vectorized assembly code
DoubleBuffer1:
.fnstart
@ BB#0:
        movw    r0, :lower16:buffer
        movt    r0, :upper16:buffer
        vld1.64 {d16, d17}, [r0:128]
        mov     r1, r0
        vshl.i32        q8, q8, #1
        vst1.32 {d16, d17}, [r1:128]!
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #32
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #48
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #64
        add     r0, r0, #80
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        vld1.64 {d16, d17}, [r0:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r0:128]
        bxlr
DoubleBuffer2:
	.fnstart
@ BB#0:
       	movw    r0, :lower16:buffer
        movt    r0, :upper16:buffer
        ldr     r1, [r0]
        lsl     r1, r1, #1
        str     r1, [r0]
        ldr     r1, [r0, #12]
        lsl     r1, r1, #1
        str     r1, [r0, #12]
        ldr     r1, [r0, #24]
        lsl     r1, r1, #1
        str     r1, [r0, #24]
        ldr     r1, [r0, #36]
        lsl     r1, r1, #1
        str     r1, [r0, #36]
        ldr     r1, [r0, #48]
        lsl     r1, r1, #1
        str     r1, [r0, #48]
        ldr     r1, [r0, #60]
        lsl     r1, r1, #1
        str     r1, [r0, #60]
        ldr     r1, [r0, #72]
        lsl     r1, r1, #1
        str     r1, [r0, #72]
        ldr     r1, [r0, #84]
        lsl     r1, r1, #1
        str     r1, [r0, #84]
        ldr     r1, [r0, #4]
        lsl     r1, r1, #1
        str     r1, [r0, #4]
        ldr     r1, [r0, #16]
        lsl     r1, r1, #1
        ...
        bx	lr

Note

Advanced SIMD (Single Instruction Multiple Data), also known as Arm Neon™ technology, is a powerful vectorizing unit on Armv7‑A and later Application profile architectures. It enables you to write highly optimized code. You can use intrinsics to directly use the Advanced SIMD capabilities from C or C++ code. The intrinsics and their data types are defined in arm_neon.h. For more information on Advanced SIMD, see the Arm® C Language Extensions, Cortex®‑A Series Programmer's Guide, and Arm® Neon™ Programmer's Guide.

Using -fno-vectorize does not necessarily prevent the compiler from emitting Advanced SIMD instructions. The compiler or linker might still introduce Advanced SIMD instructions, such as when linking libraries that contain these instructions.

To prevent the compiler from emitting Advanced SIMD instructions for AArch64 targets, specify +nosimd using -march or -mcpu:

armclang --target=aarch64-arm-none-eabi -march=armv8-a+nosimd -O2 file.c -c -S -o file.s

To prevent the compiler from emitting Advanced SIMD instructions for AArch32 targets, set the option -mfpu to the correct value that does not include Advanced SIMD. For example, set -mfpu=fp-armv8.

armclang --target=aarch32-arm-none-eabi -march=armv8-a -mfpu=fp-armv8 -O2 file.c -c -S -o file.s

Loop termination in C code

If written without caution, the loop termination condition can cause significant overhead. Where possible:

  • Use simple termination conditions.

  • Write count-down-to-zero loops and test for equality against zero.

  • Use counters of type unsigned int.

Following any or all of these guidelines, separately or in combination, is likely to result in better code.

The following table shows two sample implementations of a routine to calculate n! that together illustrate loop termination overhead. The first implementation calculates n! using an incrementing loop, while the second routine calculates n! using a decrementing loop.

Table 3-8 C code for incrementing and decrementing loops

Incrementing loop Decrementing loop
int fact1(int n)
{
    int i, fact = 1;
    for (i = 1; i <= n; i++)
        fact *= i;
    return (fact);
}
int fact2(int n)
{
    unsigned int i, fact = 1;
    for (i = n; i != 0; i--)
        fact *= i;
    return (fact);
}

The following table shows the corresponding disassembly for each of the preceding sample implementations. Generate the disassembly using:

armclang -Os -S --target=arm-arm-none-eabi -march=armv8-a

Table 3-9 C disassembly for incrementing and decrementing loops

Incrementing loop Decrementing loop
fact1:                                  
        mov     r1, r0
        mov     r0, #1
        cmp     r1, #1
        bxlt    lr
        mov     r2, #0
.LBB0_1:                                
        add     r2, r2, #1
        mul     r0, r0, r2
        cmp     r1, r2
        bne     .LBB0_1
        bx      lr
fact2:                                  
        mov     r1, r0
        mov     r0, #1
        cmp     r1, #0
        bxeq    lr
.LBB1_1:                                
        mul     r0, r0, r1
        subs    r1, r1, #1
        bne     .LBB1_1
        bx      lr

Comparing the disassemblies shows that the ADD and CMP instruction pair in the incrementing loop disassembly has been replaced with a single SUBS instruction in the decrementing loop disassembly. Because the SUBS instruction updates the status flags, including the Z flag, there is no requirement for an explicit CMP r1,r2 instruction.

Also, the variable n does not have to be available for the lifetime of the loop, reducing the number of registers that have to be maintained. Having fewer registers to maintain eases register allocation. If the original termination condition involves a function call, each iteration of the loop might call the function, even if the value it returns remains constant. In this case, counting down to zero is even more important. For example:

for (...; i < get_limit(); ...);

The technique of initializing the loop counter to the number of iterations that are required, and then decrementing down to zero, also applies to while and do statements.

Infinite loops

armclang considers infinite loops with no side-effects to be undefined behavior, as stated in the C11 and C++11 standards. In certain situations armclang deletes or moves infinite loops, resulting in a program that eventually terminates, or does not behave as expected.

To ensure that a loop executes for an infinite length of time, Arm recommends writing infinite loops in the following way:

void infinite_loop(void) {
  while (1)
    asm volatile("");	// this line is considered to have side-effects
}

armclang does not delete or move the loop, because it has side-effects.

Was this page helpful? Yes No