Example: function in a loop

Sometimes changes to source code are unavoidable if you want to use particular optimization features of the compiler. This can occur when the code is too complex for the compiler to auto-vectorize, or when you want to override the compiler's decisions about how to optimize a particular piece of code.

  1. Create a new file cubed.c containing the following function. This function calculates the cubes of an array of values.

    double cubed(double x) {
            return x*x*x;
    }
     
    void vec_cubed(double *x_vec, double *y_vec, int len_vec) {
            int i;
            for (i=0; i<len_vec; i++) {
                    y_vec[i] = cubed(x_vec[i]);
            }
    }
  2. Compile the code, using auto-vectorization:

    armclang --target=aarch64-arm-none-eabi -g -c -O1 -fvectorize cubed.c
  3. Disassemble the resulting object file to see the generated instructions:

    fromelf --disassemble cubed.o -o disassembly.txt

    The disassembled code looks similar to this:

    cubed                  ; Alternate entry point
            FMUL     d1,d0,d0
            FMUL     d0,d1,d0
            RET
    
            AREA ||.text.vec_cubed||, CODE, READONLY, ALIGN=2
    
    vec_cubed                  ; Alternate entry point
            STP      x21,x20,[sp,#-0x20]!
            STP      x19,x30,[sp,#0x10]
            CMP      w2,#1
            B.LT     |L4.48|
            MOV      x19,x1
            MOV      x20,x0
            MOV      w21,w2
    |L4.28|
            LDR      d0,[x20],#8
            BL       cubed
            SUBS     x21,x21,#1
            STR      d0,[x19],#8
            B.NE     |L4.28|
    |L4.48|
            LDP      x19,x30,[sp,#0x10]
            LDP      x21,x20,[sp],#0x20
            RET

    There are a number of issues in this code:

    • The compiler has not performed loop or SLP vectorization, or inlined our cubed function.
    • The code needs to perform checks on the input pointers to verify that the arrays do not overlap.

    These issues can be fixed in a number of ways, such as compiling at a higher optimization level, but let's focus on what code changes can be made without altering the compiler options.

  4. Add the following macros and qualifiers to the code to can override some of the compiler's decisions.

    • __attribute__((always_inline)) is an Arm Compiler extension which indicates that the compiler always attempts to inline the function. In this example, not only is the function inlined, but the compiler can also perform SLP vectorization.

      Before inlining, the cubed function works with scalar doubles only, so there is no need or way of performing SLP vectorization on this function by itself.

      When the cubed function is inlined, the compiler can detect that its operations are performed on arrays and vectorize the code with the available ASIMD instructions.

    • restrict is a standard C/C++ keyword that indicates to the compiler that a given array corresponds to a unique region of memory. This eliminates the need for run-time checks for overlapping arrays.
    • #pragma clang loop interleave_count(X) is a Clang language extension that lets you control auto-vectorization by specifying a vector width and interleaving count. This pragma is a [COMMUNITY] feature of Arm Compiler.

    A complete reference to the vectorization macros can be found in the clang documentation.

    __always_inline double cubed(double x) {
            return x*x*x;
    }
     
    void vec_cubed(double *restrict x_vec, double *restrict y_vec, int len_vec) {
            int i;
            #pragma clang loop interleave_count(2)
            for (i=0; i<len_vec; i++) {
                    y_vec[i] = cubed(x_vec[i]);
            }
    }
  5. Compile and disassemble with the same commands we used earlier. This produces the following code:

    vec_cubed                  ; Alternate entry point
            CMP      w2,#1
            B.LT     |L4.132|
            CMP      w2,#4
            MOV      w8,w2
            B.CS     |L4.28|
            MOV      x9,xzr
            B        |L4.92|
    |L4.28|
            AND      x9,x8,#0xfffffffc
            ADD      x10,x0,#0x10
            ADD      x11,x1,#0x10
            MOV      x12,x9
    |L4.44|
            LDP      q0,q1,[x10,#-0x10]
            ADD      x10,x10,#0x20
            SUBS     x12,x12,#4
            FMUL     v2.2D,v0.2D,v0.2D
            FMUL     v3.2D,v1.2D,v1.2D
            FMUL     v0.2D,v0.2D,v2.2D
            FMUL     v1.2D,v1.2D,v3.2D
            STP      q0,q1,[x11,#-0x10]
            ADD      x11,x11,#0x20
            B.NE     |L4.44|
            CMP      x9,x8
            B.EQ     |L4.132|
    |L4.92|
            LSL      x11,x9,#3
            ADD      x10,x1,x11
            ADD      x11,x0,x11
            SUB      x8,x8,x9
    |L4.108|
            LDR      d0,[x11],#8
            SUBS     x8,x8,#1
            FMUL     d1,d0,d0
            FMUL     d0,d0,d1
            STR      d0,[x10],#8
            B.NE     |L4.108|
    |L4.132|
            RET

    This disassembly shows that the inlining, SLP vectorization, and loop vectorization have been successful. Using the restrict pointers has eliminated run-time overlap checks.

    The code size has increased slightly, due to the loop tail which handles any remaining iterations when the total loop count is not a multiple of four (the effective unroll depth). The loop unroll depth is two and is the SLP width is two, so the effective unroll depth is four. In the next step we'll look at an optimization we can make if we know the loop count will always be a multiple of four.

  6. Let us assume our loop count will always be a multiple of four. We can communicate this to the compiler by masking off the lower bits of the loop counter:

    void vec_cubed(double *restrict x_vec, double *restrict y_vec, int len_vec) {
            int i;
            #pragma clang loop interleave_count(1)
            for (i=0; i<(len_vec & ~3); i++) {
                    y_vec[i] = cubed_i(x_vec[i]);
            }
    }
  7. Compile and disassemble with the same commands we used earlier. This produces the following code:

    vec_cubed                  ; Alternate entry point
            AND      w8,w2,#0xfffffffc
            CMP      w8,#1
            B.LT     |L13.40|
            MOV      w8,w8
    |L13.16|
            LDR      q0,[x0],#0x10
            SUBS     x8,x8,#2
            FMUL     v1.2D,v0.2D,v0.2D
            FMUL     v0.2D,v0.2D,v1.2D
            STR      q0,[x1],#0x10
            B.NE     |L13.16|
    |L13.40|
            RET

    The code size is reduced, because the compiler knows it no longer has to test for and deal with any remaining iterations that were not a multiple of four. Promising to the compiler that the data we supply will always be a multiple of the vector length has produced optimized code.

This example is simple enough that compiling at -O2 will perform all of these optimizations with no code changes, but more complex pieces of code might require this type of tuning to get the most from the compiler.

A full code listing is included below. You can compile and disassemble at a variety of optimization levels and unroll depths to observe the compiler's auto-vectorization behavior.

  • Full source code example: function in a loop
    /*
     * Copyright (C) Arm Limited, 2019 All rights reserved. 
     * 
     * The example code is provided to you as an aid to learning when working 
     * with Arm-based technology, including but not limited to programming tutorials. 
     * Arm hereby grants to you, subject to the terms and conditions of this Licence, 
     * a non-exclusive, non-transferable, non-sub-licensable, free-of-charge licence, 
     * to use and copy the Software solely for the purpose of demonstration and 
     * evaluation.
     * 
     * You accept that the Software has not been tested by Arm therefore the Software 
     * is provided "as is", without warranty of any kind, express or implied. In no 
     * event shall the authors or copyright holders be liable for any claim, damages 
     * or other liability, whether in action or contract, tort or otherwise, arising 
     * from, out of or in connection with the Software or the use of Software.
     */
    
    #include <stdio.h>
     
    void vec_init(double *vec, int len_vec, double init_val) {
            int i;
            for (i=0; i<len_vec; i++) {
                    vec[i] = init_val*i - len_vec/2;
            }
    }
     
    void vec_print(double *vec, int len_vec) {
            int i;
            for (i=0; i<len_vec; i++) {
                    printf("%f, ", vec[i]);
            }
            printf("\n");
    }
     
    double cubed(double x) {
            return x*x*x;
    }
     
    void vec_cubed(double *x_vec, double *y_vec, int len_vec) {
            int i;
            for (i=0; i<len_vec; i++) {
                    y_vec[i] = cubed(x_vec[i]);
            }
    }
     
    __attribute__((always_inline)) double cubed_i(double x) {
            return x*x*x;
    }
     
    void vec_cubed_opt(double *restrict x_vec, double *restrict y_vec, int len_vec) {
            int i;
            #pragma clang loop interleave_count(1)
            for (i=0; i<len_vec; i++) {
                    y_vec[i] = cubed_i(x_vec[i]);
            }
    }
     
     
    int main() {
            int N = 10;
            double X[N];
            double Y[N];
     
            vec_init(X, N, 1);
            vec_print(X, N);
            vec_cubed(X, Y, 10);
            vec_print(Y, N);
            vec_cubed_opt(X, Y, 10);
            vec_print(Y, N);
    return 0; }
Previous Next