You copied the Doc URL to your clipboard.

Optimizing loops

Loops can take a significant amount of time to complete depending on the number of iterations in the loop. The overhead of checking a condition for each iteration of the loop can degrade the performance of the loop.

Loop unrolling

You can reduce the impact of this overhead by unrolling some of the iterations, which in turn reduces the number of iterations for checking the condition. Use #pragma unroll (n) to unroll time-critical loops in your source code. However, unrolling loops has the disadvantage of increasing the codes size. These pragmas are only effective at optimization -O2, -O3, -Ofast, and -Omax.

Table 3-1 Loop unrolling pragmas

Pragma Description
#pragma unroll (n) Unroll n iterations of the loop.
#pragma unroll_completely Unroll all the iterations of the loop.

The examples below show code with loop unrolling and code without loop unrolling.

Table 3-2 Loop optimizing example

Bit counting loop without unrolling Bit counting loop with unrolling
int countSetBits1(unsigned int n)
{
    int bits = 0;

    while (n != 0)
    {
        if (n & 1) bits++;
        n >>= 1;
    }
    return bits;
}
int countSetBits2(unsigned int n)
{
    int bits = 0;
    #pragma unroll (4) 
    while (n != 0)
    {
        if (n & 1) bits++;
        n >>= 1;
    }
    return bits;
}

The code below shows the code that Arm® Compiler generates for the above examples. Copy the examples above into file.c and compile using:

armclang --target=arm-arm-none-eabi -march=armv8-a file.c -O2 -c -S -o file.s

For the function with loop unrolling, countSetBits2, the generated code is faster but larger in size.

Table 3-3 Loop examples

Bit counting loop without unrolling Bit counting loop with unrolling
countSetBits1:       
        mov     r1, r0
        mov     r0, #0
        cmp     r1, #0
        bxeq    lr
        mov     r2, #0
        mov     r0, #0
.LBB0_1:                                
        and     r3, r1, #1
        cmp     r2, r1, asr #1
        add     r0, r0, r3
        lsr     r3, r1, #1
        mov     r1, r3
        bne     .LBB0_1
        bx      lr
countSetBits2:                              
        mov     r1, r0
        mov     r0, #0
        cmp     r1, #0
        bxeq    lr
        mov     r2, #0
        mov     r0, #0
LBB0_1:                                
        and     r3, r1, #1
        cmp     r2, r1, asr #1
        add     r0, r0, r3
        beq     .LBB0_4
@ BB#2:                                 
        asr     r3, r1, #1
        cmp     r2, r1, asr #2
        and     r3, r3, #1
        add     r0, r0, r3
        asrne   r3, r1, #2
        andne   r3, r3, #1
        addne   r0, r0, r3
        cmpne   r2, r1, asr #3
        beq     .LBB0_4
@ BB#3:                                
        asr     r3, r1, #3
        cmp     r2, r1, asr #4
        and     r3, r3, #1
        add     r0, r0, r3
        asr     r3, r1, #4
        mov     r1, r3
        bne     .LBB0_1
.LBB0_4:
        bx	lr

Arm Compiler can unroll loops completely only if the number of iterations is known at compile time.

Loop vectorization

If your target has the Advanced SIMD unit, then Arm Compiler can use the vectorizing engine to optimize vectorizable sections of the code. At optimization level -O1, you can enable vectorization using -fvectorize. At higher optimizations, -fvectorize is enabled by default and you can disable it using -fno-vectorize. See -fvectorize in the armclang Reference Guide for more information. When using -fvectorize with -O1, vectorization might be inhibited in the absence of other optimizations which might be present at -O2 or higher.

For example, loops that access structures can be vectorized if all parts of the structure are accessed within the same loop rather than in separate loops. The following examples show code with a loop that can be vectorized by Advanced SIMD, and a loop that cannot be vectorized easily.

Table 3-4 Example loops

Vectorizable by Advanced SIMD Not vectorizable by Advanced SIMD
typedef struct tBuffer {
  int a;
  int b;
  int c;
} tBuffer;
tBuffer buffer[8];

void DoubleBuffer1 (void)
{
  int i;
  for (i=0; i<8; i++)
  {
    buffer[i].a *= 2;
    buffer[i].b *= 2;
    buffer[i].c *= 2;
  }
}
typedef struct tBuffer {
  int a;
  int b;
  int c;
} tBuffer;
tBuffer buffer[8];

void DoubleBuffer2 (void)
{
  int i;
  for (i=0; i<8; i++)
    buffer[i].a *= 2;
  for (i=0; i<8; i++)
    buffer[i].b *= 2; 
  for (i=0; i<8; i++)
    buffer[i].c *= 2;
}

For each example above, copy the code into file.c and compile at optimization level O2 to enable auto-vectorization:

armclang --target=arm-arm-none-eabi -march=armv8-a -O2 file.c -c -S -o file.s

The vectorized assembly code contains the Advanced SIMD instructions, for example vld1, vshl, and vst1. These Advanced SIMD instructions are not generated when compiling the example with the non-vectorizable loop.

Table 3-5 Assembly code from vectorizable and non-vectorizable loops

Vectorized assembly code Non-vectorized assembly code
DoubleBuffer1:
.fnstart
@ BB#0:
        movw    r0, :lower16:buffer
        movt    r0, :upper16:buffer
        vld1.64 {d16, d17}, [r0:128]
        mov     r1, r0
        vshl.i32        q8, q8, #1
        vst1.32 {d16, d17}, [r1:128]!
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #32
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #48
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        add     r1, r0, #64
        add     r0, r0, #80
        vld1.64 {d16, d17}, [r1:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r1:128]
        vld1.64 {d16, d17}, [r0:128]
        vshl.i32        q8, q8, #1
        vst1.64 {d16, d17}, [r0:128]
        bxlr
DoubleBuffer2:
	.fnstart
@ BB#0:
       	movw    r0, :lower16:buffer
        movt    r0, :upper16:buffer
        ldr     r1, [r0]
        lsl     r1, r1, #1
        str     r1, [r0]
        ldr     r1, [r0, #12]
        lsl     r1, r1, #1
        str     r1, [r0, #12]
        ldr     r1, [r0, #24]
        lsl     r1, r1, #1
        str     r1, [r0, #24]
        ldr     r1, [r0, #36]
        lsl     r1, r1, #1
        str     r1, [r0, #36]
        ldr     r1, [r0, #48]
        lsl     r1, r1, #1
        str     r1, [r0, #48]
        ldr     r1, [r0, #60]
        lsl     r1, r1, #1
        str     r1, [r0, #60]
        ldr     r1, [r0, #72]
        lsl     r1, r1, #1
        str     r1, [r0, #72]
        ldr     r1, [r0, #84]
        lsl     r1, r1, #1
        str     r1, [r0, #84]
        ldr     r1, [r0, #4]
        lsl     r1, r1, #1
        str     r1, [r0, #4]
        ldr     r1, [r0, #16]
        lsl     r1, r1, #1
        ...
        bx	lr

Note

Advanced SIMD (Single Instruction Multiple Data), also known as Arm NEON™ technology, is a powerful vectorizing unit on Armv7‑A and later Application profile architectures. It enables you to write highly optimized code. You can use intrinsics to directly use the Advanced SIMD capabilities from C or C++ code. The intrinsics and their data types are defined in arm_neon.h. For more information on Advanced SIMD, see the Arm® C Language Extensions, Cortex®‑A Series Programmer's Guide, and Arm® NEON™ Programmer's Guide.

Using -fno-vectorize does not necessarily prevent the compiler from emitting Advanced SIMD instructions. The compiler or linker might still introduce Advanced SIMD instructions, such as when linking libraries that contain these instructions.

To prevent the compiler from emitting Advanced SIMD instructions for AArch64 targets, specify +nosimd using -march or -mcpu:

armclang --target=aarch64-arm-none-eabi -march=armv8-a+nosimd -O2 file.c -c -S -o file.s

To prevent the compiler from emitting Advanced SIMD instructions for AArch32 targets, set the option -mfpu to the correct value that does not include Advanced SIMD, for example fp-armv8.

armclang --target=aarch32-arm-none-eabi -march=armv8-a -mfpu=fp-armv8 -O2 file.c -c -S -o file.s
Was this page helpful? Yes No