You copied the Doc URL to your clipboard.

Generate assembly code from C and C++ code

Arm® C/C++ Compiler can produce annotated assembly, and this is a good first step to see how the compiler vectorizes loops.

Note

Different compiler options are required to make use of SVE functionality. If you are using SVE, please refer to Compiling C/C++ code for Arm SVE architectures.

Example

The following C program subtracts corresponding elements in two arrays, writing the result to a third array. The three arrays are declared using the restrict keyword, indicating to the compiler that they do not overlap in memory.

// example1.c
#define ARRAYSIZE 1024
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
void subtract_arrays(int *restrict a, int *restrict b, int *restrict c)
{
    for (int i = 0; i < ARRAYSIZE; i++)
    {
        a[i] = b[i] - c[i];
    }
}
int main()
{
    subtract_arrays(a, b, c);
}

Compile the program as follows:

armclang -O1 -S -o example1.s example1.c

The flag -S is used to output assembly code.The output assembly code is saved as example1.s. The section of the generated assembly language file containing the compiled subtract_arrays function appears as follows:

subtract_arrays:                        // @subtract_arrays
// BB#0:
        mov     x8, xzr
.LBB0_1:                                // =>This Inner Loop Header: Depth=1
        ldr     w9, [x1, x8]
        ldr     w10, [x2, x8]
        sub     w9, w9, w10
        str     w9, [x0, x8]
        add     x8, x8, #4              // =4
        cmp     x8, #1, lsl #12         // =4096
        b.ne    .LBB0_1
// BB#2:
        ret

This code shows that the compiler has not performed any vectorization, because we specified the -O1 (low optimization) option. Array elements are iterated over one at a time. Each array element is a 32-bit or 4-byte integer, so the loop increments by 4 each time. The loop stops when it reaches the end of the array (1024 iterations * 4 bytes later).

Enable auto-vectorization

To enable auto-vectorization, increase the optimization level using the -Olevel option. The -O0 option is the lowest optimization level, while -O3 is the highest. Arm C/C++ Compiler only performs auto-vectorization at -O2 and higher:

armclang -O2 -S -o example1.s example1.c

The output assembly code is saved as example1.s. The section of the generated assembly language file containing the compiled subtract_arrays function appears as follows:

subtract_arrays:                        // @subtract_arrays
// BB#0:
        mov     x8, xzr
        add     x9, x0, #16             // =16
.LBB0_1:                                // =>This Inner Loop Header: Depth=1
        add     x10, x1, x8
        add     x11, x2, x8
        ldp     q0, q1, [x10]
        ldp     q2, q3, [x11]
        add     x10, x9, x8
        add     x8, x8, #32             // =32
        cmp     x8, #1, lsl #12         // =4096
        sub     v0.4s, v0.4s, v2.4s
        sub     v1.4s, v1.4s, v3.4s
        stp     q0, q1, [x10, #-16]
        b.ne    .LBB0_1
// BB#2:
        ret

This time, we can see that Arm C/C++ Compiler has done something different. SIMD (Single Instruction Multiple Data) instructions and registers have been used to vectorize the code. Notice that the LDP instruction is used to load array values into the 128-bit wide Q registers. Each vector instruction is operating on four array elements at a time, and the code is using two sets of Q registers to double up and operate on eight array elements in each iteration. Consequently each loop iteration moves through the array by 32 bytes (2 sets * 4 elements * 4 bytes) at a time.

Was this page helpful? Yes No