Example: vector addition

Let's look at how we can use compiler options to auto-vectorize and optimize a simple C program.

  1. Create a new file vec_add.c containing the following function. This function adds two arrays of 32-bit floating-point values.

    void vec_add(float *vec_A, float *vec_B, float *vec_C, int len_vec) {
            int i;
            for (i=0; i<len_vec; i++) {
                    vec_C[i] = vec_A[i] + vec_B[i];
            }
    }
  2. Compile the code, without using auto-vectorization:

    armclang --target=aarch64-arm-none-eabi -g -c -O1 vec_add.c
  3. Disassemble the resulting object file to see the generated instructions:

    fromelf --disassemble vec_add.o -o disassembly_vec_off.txt

    The disassembled code looks similar to this:

    vec_add                  ; Alternate entry point
            CMP      w3,#1
            B.LT     |L3.36|
            MOV      w8,w3
    |L3.12|
            LDR      s0,[x0],#4
            LDR      s1,[x1],#4
            SUBS     x8,x8,#1
            FADD     s0,s0,s1
            STR      s0,[x2],#4
            B.NE     |L3.12|
    |L3.36|
            RET
    

    Here we can see the label name vec_add for the function, followed by the generated assembly instructions that make up the function. The FADD instruction performs the core part of the operation, but the code is not making use of Neon as only one addition operation is performed at a time. We can see this because the FADD instruction is operating on the scalar registers S0 and S1.

  4. Re-compile the code, this time using auto-vectorization:

    armclang --target=aarch64-arm-none-eabi -g -c -O1 vec_add.c -fvectorize
  5. Disassemble the resulting object file to see the generated instructions:

    fromelf --disassemble vec_add.o -o disassembly_vec_on.txt

    The disassembled code looks similar to this:

    vec_add                  ; Alternate entry point
            CMP      w3,#1
            B.LT     |L3.184|
            CMP      w3,#4
            MOV      w8,w3
            MOV      x9,xzr
            B.CC     |L3.140|
            LSL      x10,x8,#2
            ADD      x12,x0,x10
            ADD      x11,x2,x10
            CMP      x12,x2
            ADD      x10,x1,x10
            CSET     w12,HI
            CMP      x11,x0
            CSET     w13,HI
            CMP      x10,x2
            CSET     w10,HI
            CMP      x11,x1
            AND      w12,w12,w13
            CSET     w11,HI
            TBNZ     w12,#0,|L3.140|
            AND      w10,w10,w11
            TBNZ     w10,#0,|L3.140|
            AND      x9,x8,#0xfffffffc
            MOV      x10,x9
            MOV      x11,x2
            MOV      x12,x1
            MOV      x13,x0
    |L3.108|
            LDR      q0,[x13],#0x10
            LDR      q1,[x12],#0x10
            SUBS     x10,x10,#4
            FADD     v0.4S,v0.4S,v1.4S
            STR      q0,[x11],#0x10
            B.NE     |L3.108|
            CMP      x9,x8
            B.EQ     |L3.184|
    |L3.140|
            LSL      x12,x9,#2
            ADD      x10,x2,x12
            ADD      x11,x1,x12
            ADD      x12,x0,x12
            SUB      x8,x8,x9
    |L3.160|
            LDR      s0,[x12],#4
            LDR      s1,[x11],#4
            SUBS     x8,x8,#1
            FADD     s0,s0,s1
            STR      s0,[x10],#4
            B.NE     |L3.160|
    |L3.184|
            RET

    SLP auto-vectorization has been successful, as we can see from the instruction FADD v0.4S,v0.4S,v1.4S which performs an addition on four 32-bit floats packed into a SIMD register. However this has come at significant cost to code size as it must detect cases where the SIMD width is not a divisor of the array length. Such increases in code size may or may not be acceptable depending on the project and target hardware. This may be tolerable for a phone application where the change in code size is insignificant compared with the available memory, but could be unacceptable for an embedded application with a small amount of RAM.

  6. A complete code listing is included below. Compile and disassemble at different optimization levels to see the effect on the generated code.

  • Full source code example: vector addition
    /*
     * Copyright (C) Arm Limited, 2019 All rights reserved. 
     * 
     * The example code is provided to you as an aid to learning when working 
     * with Arm-based technology, including but not limited to programming tutorials. 
     * Arm hereby grants to you, subject to the terms and conditions of this Licence, 
     * a non-exclusive, non-transferable, non-sub-licensable, free-of-charge licence, 
     * to use and copy the Software solely for the purpose of demonstration and 
     * evaluation.
     * 
     * You accept that the Software has not been tested by Arm therefore the Software 
     * is provided "as is", without warranty of any kind, express or implied. In no 
     * event shall the authors or copyright holders be liable for any claim, damages 
     * or other liability, whether in action or contract, tort or otherwise, arising 
     * from, out of or in connection with the Software or the use of Software.
     */
    #include <stdio.h>
    
    void vec_init(float *vec, int len_vec, float init_val) {
            int i;
            for (i=0; i<len_vec; i++) {
                    vec[i] = init_val;
            }
    }
    
    void vec_print(float *vec, int len_vec) {
            int i;
            for (i=0; i<len_vec; i++) {
                    printf("%f, ", vec[i]);
            }
            printf("\n");
    }
    
    void vec_add(float *vec_A, float *vec_B, float *vec_C, int len_vec) {
            int i;
            for (i=0; i<len_vec; i++) {
                    vec_C[i] = vec_A[i] + vec_B[i];
            }
    }
    
    
    int main() {
            int N = 10;
            float A[N];
            float B[N];
            float C[N];
    
            vec_init(A, N, 1.0);
            vec_init(B, N, 1.0);
    
            vec_add(A, B, C, N);
    
            vec_print(A, N);
            vec_print(B, N);
            vec_print(C, N);
    return 0; }
Previous Next