You copied the Doc URL to your clipboard.

Coding for SVE vs Neon

This topic summarizes the important differences between coding for the Scalable Vector Extension (SVE) and coding for Neon. For users who have already ported their applications to Arm®v8-A Neon hardware, it also highlights the key differences to consider when porting it to SVE.

Arm Neon technology is the Advanced SIMD (Single Instruction Multiple Data) feature for the Armv8-A architecture profile. Neon is a feature of the Instruction Set Architecture, providing instructions that can perform mathematical operations in parallel on multiple data streams.

SVE is the next-generation SIMD extension of the Armv8-A instruction set. It is not an extension of Neon, but is a new set of vector instructions developed to target HPC workloads. In short, SVE enables vectorization of loops which would be impossible, or not beneficial, to vectorize with Neon. Importantly, and unlike other SIMD architectures, SVE can be Vector Length Agnostic (VLA); it does does not fix the size of the vector registers, instead it leaves hardware implementors free to choose the size best suited to the intended workloads.

Data processing methodologies: SISD and SIMD

Most Arm instructions are Single Instruction Single Data (SISD). Each instruction performs one operation and writes to one output data stream. Processing multiple items requires multiple instructions.

For example, in traditional SISD instruction sets, to perform four separate addition operations requires four instructions to add values from four pairs of registers:

ADD x0, x0, x5
ADD x1, x1, x6
ADD x2, x2, x7
ADD x3, x3, x8

Single Instruction Multiple Data (SIMD) instructions perform the same operation simultaneously for multiple items. These items are packed as separate elements in a larger register.

For example, the following instruction adds four pairs of single-precision (32-bit) values together. However, in this case, the values are packed as separate lanes in one pair of 128-bit registers. Each lane in the first source register is then added to the corresponding lane in the second source register, before being stored in the destination register:

ADD Q8.4S, Q8.4S, Q9.4S

Performing the four operations with a single SIMD instruction is more efficient than with four separate SISD instructions.

Fundamentals: Instruction sets

AArch64 is the name that is used to describe the 64-bit Execution state of the Armv8 architecture. In AArch64 state, the processor executes the A64 Instruction Set, which contains Neon instructions (also referred to as Advanced SIMD instructions). The SVE extension is introduced in version Armv8.2-A of the architecture, and adds a new subset of instructions to the existing Armv8-A A64 Instruction Set.

Summary of the Instruction Set extensions

Extension

Key feature(s)

Categorization of new instructions

Neon

Provides additional instructions that can perform mathematical operations in parallel on multiple data streams. Support for double precision floating-point, enabling C code using double precision.

  • Promotion/Demotion

  • Pair-wise operations

  • Load and store operations

  • Logical operators

  • Multiplication operation

SVE

SVE adds: * Support for wide vector and predicate registers (resulting in two main classes of instructions; predicated and unpredicated). * A set of instructions that operate on wide vectors. * Some minor additions to the configuration and identification registers.

  • Load, store, and prefetch instructions.

  • Integer operations.

  • Vector address calculation.

  • Bitwise operations.

  • Floating-point operations.

  • Predicate operations.

  • Move operations.

  • Reduction operations.

For descriptions of each, see What is the Scalable Vector Extension?.

For more information about the Neon instruction set, see the A64 Instruction set for Armv8-A. For more information about the SVE instruction set extension, see ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A.

Fundamentals: Registers, vectors, lanes, and elements

Neon units operate on a separate register file of 128-bit registers and are fully integrated into Armv8-A processors. Neon units use a simple programming model because they use the same address space as an application.

The Neon register file is a collection of registers which can be accessed as 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit registers.

The Neon registers contain vectors of some consistent data type. A vector is divided into lanes and each lane contains a data value that is called an element.

The number of lanes in a Neon vector depends on the size of the vector and the data elements in the vector. For example, a 128-bit Neon vector can contain the following element sizes:

  • Sixteen 8-bit elements

  • Eight 16-bit elements

  • Four 32-bit elements

  • Two 64-bit elements

However, Neon instructions always operate on 64-bit or 128-bit vectors.

In SVE, the instruction set operates on a new set of vector and predicate registers: 32 Z registers, 16 P registers, and one First Faulting Register (FFR):

  • The Z registers are data registers. Z register bits are an implementation defined multiple of 128, up to an architectural maximum of up to 2048-bits. Data in these registers can be interpreted as 8-bit, 16-bit, 32-bit, 64-bit, or 128-bit. The low 128 bits of each Z register overlap the corresponding Neon registers, and therefore also the scalar floating-point registers.

  • In every processor implementation, the P registers hold one bit for each byte available in a Z register. In other words, a P register is always 1/8th the size of the Z register width.

  • The FFR register is a special predicate register that certain instructions can use implicitly. Both P registers and the FFR register are unique to SVE.

Fundamentals: Vector Length Agnostic (VLA) programming

SVE introduces the concept of Vector Length Agnostic (VLA) programming.

Unlike traditional SIMD architectures, which define a fixed size for their vector registers, SVE only specifies a maximum size. This freedom of choice enables different Arm architectural licensees to develop their own implementation, targeting specific workloads and technologies which could benefit from a particular vector length.

A key goal of SVE is to allow the same program image to be run on any implementation of the architecture, so it includes instructions which permit vector code to adapt automatically to the current vector length at runtime.

More information about VLA programming is provided in a later chapter, see SVE Vector Length Agnostic programming.

Coding best practices

As a programmer, there are a number of ways you can make use of Neon and SVE technology:

  • Neon and SVE-enabled math libraries, such as Arm Performance Libraries.

    Note

    SVE-enabled library introduced in Arm Compiler for Linux version 19.3+.

  • Auto-vectorization features in your compiler can automatically optimize your code to take advantage of Neon and SVE.

  • Instrinsics are function calls that the compiler replaces with appropriate Neon or SVE instructions. This gives you direct access to the exact Neon or SVE instructions you want. For a searchable index for Neon intrinsics, see Neon intrinsics. The SVE instrinsics are defined in the Arm C Language Extensions for SVE specification.

  • For very high performance, hand-coded Neon or SVE assembly code can be an alternative approach for experienced programmers.

Coding best practices: Compiler optimization

The Arm Compiler for Linux can automatically generate code that contains Armv8 Neon and SVE instructions. Allowing the compiler to automatically identify opportunities in your code to use Neon or SVE instructions is called auto-vectorization.

In terms of specific compilation techniques, auto-vectorization includes:

  • Loop vectorization: unrolling loops to reduce the number of iterations, while performing more operations in each iteration.

  • Superword-Level Parallelism (SLP) vectorization: bundling scalar operations together to make use of full width Advanced SIMD instructions.

The benefits of relying on compiler auto-vectorization are:

  • Programs implemented in high level languages are portable, so long as there are no architecture-specific code elements such as inline assembly or intrinsics.

  • Modern compilers are capable of performing advanced optimizations automatically.

  • Targeting a given micro-architecture can be as easy as setting a single compiler option. However, hand-optimizing a program in assembly requires deep knowledge of the target hardware.

To compile for AArch64 with Arm Compiler for Linux, see the following quick reference table:

Auto-vectorizaton with Arm Compiler for Linux

Extension

Header file form

Recommended Arm Compiler for Linux command line

Notes

Neon

#include <arm_neon.h>

armclang -O<level> -mcpu={native|<target>} -o <binary_name> <filename>.c

To take advantage of micro-architectural optimizations, set -mcpu to the target processor your application will run on. If the target processor is the same processor you are compiling your code on, you can set``-mcpu=native`` for the compiler to automatically detect which processor this is.

-march=armv8-a is also supported, but would not include the micro-architectural optimizations. For more information about architectural compiler flags and data to support this recommendation, see the Compiler flags across architectures: -march, -mtune, and -mcpu blog.

SVE

#ifdef __ARM_FEATURE_SVE
#include <arm_sve.h>
#endif /* __ARM_FEATURE_SVE */

armclang -O<level> -march=armv8-a+sve -o <binary_name> <filename>.c

-march=armv8-a+sve ensures the compiler optimizes for Armv8-A hardware, on which you can use ArmIE to emulate the SVE instructions.

When SVE-enabled hardware is available and you are compiling on that target SVE hardware, Arm recommends using -mcpu=native instead, so that micro-architectural optimizations can be taken advantage of.

Supported optimization levels for -O<level> for both Neon and SVE code include:

Supported Arm Compiler for Linux optimization options for auto-vectorization

Option

Description

Auto-vectorization

-O0

Minimum optimization for the performance of the compiled binary. Turns off most optimizations. When debugging is enabled, this option generates code that directly corresponds to the source code. Therefore, this might result in a significantly larger image. This is the default optimization level.

Never

-O1

Restricted optimization. When debugging is enabled, this option gives the best debug view for the trade-off between image size, performance, and debug.

Disabled by default.

-O2

High optimization. When debugging is enabled, the debug view might be less satisfactory because the mapping of object code to source code is not always clear. The compiler might perform optimizations that cannot be described by debug information.

Enabled by default.

-O3

Very high optimization. When debugging is enabled, this option typically gives a poor debug view. Arm recommends debugging at lower optimization levels.

Enabled by default.

-Ofast

Enable all the optimizations from level 3, including those performed with the -ffp-mode=fast armclang option. This level also performs other aggressive optimizations that might violate strict compliance with language standards.

Enabled by default.

Note

  • Auto-vectorization is enabled by default at optimization level -O2 and higher. The -fno-vectorize option lets you disable auto-vectorization.

  • At optimization level -O0, auto-vectorization is always disabled. If you specify the -fvectorize option, the compiler ignores it.

  • At optimization level -O1, auto-vectorization is disabled by default. The -fvectorize option lets you enable auto-vectorization.

As an implementation becomes more complicated, the likelihood that the compiler can auto-vectorize the code decreases. For example, loops with the following characteristics are particularly difficult (or impossible) to vectorize:

  • Loops with interdependencies between different loop iterations.

  • Loops with break clauses.

  • Loops with complex conditions.

Neon and SVE have different requirements when it comes to the conditions for auto-vectorization. For example, a necessary condition for auto-vectorizing Neon code is that the number of iterations in the loop size must be known at the start of the loop, at execution time. However, knowing the number of iterations in the loop size is not required to auto-vectorize SVE code.

Note

Break conditions mean the loop size might not be knowable at the start of the loop, which prevents auto-vectorization for Neon code. If it is not possible to completely avoid a break condition, it might be worthwhile breaking up the loops into multiple vectorizable and non-vectorizable parts.

A full discussion of the compiler directives used to control vectorization of loops for can be found in the LLVM-Clang documentation, but the two most important are:

  • #pragma clang loop vectorize(enable)

  • #pragma clang loop interleave(enable)

These pragmas are hints to the compiler to perform SLP and Loop vectorization respectively. More detailed guides covering auto-vectorization are available in the Arm C/C++ Compiler and Arm Fortran Compiler Reference guides:

Coding best practices: Intrinsics

Intrinsics are functions whose precise implementation is known to a compiler. These functions let you use Neon or SVE without having to write assembly code because the functions themselves contain short assembly kernels, which are inlined into the calling code. In addition, register allocation and pipeline optimization are handled by the compiler, avoiding many of the difficulties often seen when developing assembly code.

Using intrinsics has several benefits:

  • Powerful: Intrinsics give the developer direct access to the Neon and SVE instruction sets, without the need for hand-written assembly code.

  • Portable: Hand-written Neon or SVE assembly instructions might need to be rewritten for different target processors. C and C++ code containing Neon intrinsics can be compiled for a new AArch64 target or a new Execution state with minimal or no code changes. However, C and C++ code containing SVE intrinsics will only run on SVE-enabled hardware.

  • Flexible: The developer can exploit Neon when needed, or use C/C++ when it is not, while avoiding many low-level engineering concerns.

However, intrinsics might not be the right choice in all situations:

  • More learning is required to use intrinsics, than to import a library or to rely on a compiler.

  • Hand-optimized assembly code might offer the greatest scope for performance improvement, even if it is more difficult to write.

For a list of all the Neon intrinsics, see the Neon intrinsics. The Neon intrinsics engineering specification is contained in the Arm C Language Extensions (ACLE).

The SVE intrinsics engineering specification is contained in the Arm C Language Extensions for SVE specification.

Example 1: Simple matrix multiplication with intrinsics

This example implements some C functions using Neon intrinsics and using SVE intrinsics. The example chosen does not demonstrate the full complexity of the application, but illustrates the use of intrinsics, and is a starting point for more complex code.

Matrix multiplication is an operation performed in many data intensive applications and consists of groups of arithmetic operations which are repeated in a simple way:

Matrix multiplication diagram

The matrix multiplication process is as follows:

  1. Take a row in the first matrix - 'A'

  2. Perform a dot product of this row with a column from the second matrix - 'B'

  3. Store the result in the corresponding row and column of a new matrix - 'C'

For matrices of 32-bit floats, the multiplication could be written as:

void matrix_multiply_c(float32_t *A, float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    for (int i_idx=0; i_idx < n; i_idx++) {
        for (int j_idx=0; j_idx < m; j_idx++) {
            C[n*j_idx + i_idx] = 0;
            for (int k_idx=0; k_idx < k; k_idx++) {
                C[n*j_idx + i_idx] += A[n*k_idx + i_idx]*B[k*j_idx + k_idx];
            }
        }
    }
}

Assume a column-major layout of the matrices in memory. That is, an n x m matrix M, is represented as an array M_array, where Mij = M_array[n*j + i].

This code is sub-optimal, because it does not make full use of Neon. Intrinsics can be used to improve it.

The following code uses intrinsics to multiply two 4x4 matrices. The loops can be completely unrolled because there is a small, fixed number of values to process, all of which can fit into the Neon registers of the processor at the same time.

void matrix_multiply_4x4_neon(const float32_t *A, const float32_t *B, float32_t *C) {
    // these are the columns A
    float32x4_t A0;
    float32x4_t A1;
    float32x4_t A2;
    float32x4_t A3;

    // these are the columns B
    float32x4_t B0;
    float32x4_t B1;
    float32x4_t B2;
    float32x4_t B3;

    // these are the columns C
    float32x4_t C0;
    float32x4_t C1;
    float32x4_t C2;
    float32x4_t C3;

    A0 = vld1q_f32(A);
    A1 = vld1q_f32(A+4);
    A2 = vld1q_f32(A+8);
    A3 = vld1q_f32(A+12);

    // Zero accumulators for C values
    C0 = vmovq_n_f32(0);
    C1 = vmovq_n_f32(0);
    C2 = vmovq_n_f32(0);
    C3 = vmovq_n_f32(0);

    // Multiply accumulate in 4x1 blocks, that is each column in C
    B0 = vld1q_f32(B);
    C0 = vfmaq_laneq_f32(C0, A0, B0, 0);
    C0 = vfmaq_laneq_f32(C0, A1, B0, 1);
    C0 = vfmaq_laneq_f32(C0, A2, B0, 2);
    C0 = vfmaq_laneq_f32(C0, A3, B0, 3);
    vst1q_f32(C, C0);

    B1 = vld1q_f32(B+4);
    C1 = vfmaq_laneq_f32(C1, A0, B1, 0);
    C1 = vfmaq_laneq_f32(C1, A1, B1, 1);
    C1 = vfmaq_laneq_f32(C1, A2, B1, 2);
    C1 = vfmaq_laneq_f32(C1, A3, B1, 3);
    vst1q_f32(C+4, C1);

    B2 = vld1q_f32(B+8);
    C2 = vfmaq_laneq_f32(C2, A0, B2, 0);
    C2 = vfmaq_laneq_f32(C2, A1, B2, 1);
    C2 = vfmaq_laneq_f32(C2, A2, B2, 2);
    C2 = vfmaq_laneq_f32(C2, A3, B2, 3);
    vst1q_f32(C+8, C2);

    B3 = vld1q_f32(B+12);
    C3 = vfmaq_laneq_f32(C3, A0, B3, 0);
    C3 = vfmaq_laneq_f32(C3, A1, B3, 1);
    C3 = vfmaq_laneq_f32(C3, A2, B3, 2);
    C3 = vfmaq_laneq_f32(C3, A3, B3, 3);
    vst1q_f32(C+12, C3);
}

Fixed-size 4x4 matrices are chosen because:

  • Some applications need 4x4 matrices specifically, for example: graphics or relativistic physics.

  • The Neon vector registers hold four 32-bit values. Matching the program to the architecture makes it easier to optimize.

  • This 4x4 kernel can be used in a more general kernel.

Summarizing the Neon intrinsics that have been used here:

Summary of Neon intrinsics used

Code element

What is it?

Why are they used?

float32x4_t

An array of four 32-bit floats.

One uint32x4_t fits into a 128-bit register and ensures that there are no wasted register bits, even in C code.

vld1q_f32(...)

A function which loads four 32-bit floats into float32x4_t.

To get the matrix values needed from A and B.

vfmaq_lane_f32(...)

A function which uses the fused multiply accumulate instruction. Multiplies a float32x4_t value by a single element of another float32x4_t then adds the result to a third float32x4_t before returning the result.

Since the matrix row-on-column dot products are a set of multiplications and additions, this operation fits naturally.

vst1q_f32(...)

A function which stores float32x4_t at a given address.

To store the results after they are calculated.

Optimizing a similar case for SVE gives the following code:

void matrix_multiply_nx4_neon(const float32_t *A, const float32_t *B, float32_t *C, uint32_t n) {
    // these are the columns A
    svfloat32_t A0;
    svfloat32_t A1;
    svfloat32_t A2;
    svfloat32_t A3;

    // these are the columns B
    svfloat32_t B0;
    svfloat32_t B1;
    svfloat32_t B2;
    svfloat32_t B3;

    // these are the columns C
    svfloat32_t C0;
    svfloat32_t C1;
    svfloat32_t C2;
    svfloat32_t C3;

    svbool_t pred = svwhilelt_b32_u32(0, n);
    A0 = svld1_f32(pred, A);
    A1 = svld1_f32(pred, A+n);
    A2 = svld1_f32(pred, A+2*n);
    A3 = svld1_f32(pred, A+3*n);

    // Zero accumulators for C values
    C0 = svdup_n_f32(0);
    C1 = svdup_n_f32(0);
    C2 = svdup_n_f32(0);
    C3 = svdup_n_f32(0);

    // Multiply accumulate in 4x1 blocks, that is each column in C
    B0 = svld1rq_f32(svptrue_b32(), B);
    C0 = svmla_lane_f32(C0, A0, B0, 0);
    C0 = svmla_lane_f32(C0, A1, B0, 1);
    C0 = svmla_lane_f32(C0, A2, B0, 2);
    C0 = svmla_lane_f32(C0, A3, B0, 3);
    svst1_f32(pred, C, C0);

    B1 = svld1rq_f32(svptrue_b32(), B+4);
    C1 = svmla_lane_f32(C1, A0, B1, 0);
    C1 = svmla_lane_f32(C1, A1, B1, 1);
    C1 = svmla_lane_f32(C1, A2, B1, 2);
    C1 = svmla_lane_f32(C1, A3, B1, 3);
    svst1_f32(pred, C+4, C1);

    B2 = svld1rq_f32(svptrue_b32(), B+8);
    C2 = svmla_lane_f32(C2, A0, B2, 0);
    C2 = svmla_lane_f32(C2, A1, B2, 1);
    C2 = svmla_lane_f32(C2, A2, B2, 2);
    C2 = svmla_lane_f32(C2, A3, B2, 3);
    svst1_f32(pred, C+8, C2);

    B3 = svld1rq_f32(svptrue_b32(), B+12);
    C3 = svmla_lane_f32(C3, A0, B3, 0);
    C3 = svmla_lane_f32(C3, A1, B3, 1);
    C3 = svmla_lane_f32(C3, A2, B3, 2);
    C3 = svmla_lane_f32(C3, A3, B3, 3);
    svst1_f32(pred, C+12, C3);
}

Summarizing the SVE intrinsics that have been used here:

Summary of SVE intrinsics used

Code element

What is it?

Why are they used?

svfloat32_t

An array of 32-bit floats, where the exact number is defined at runtime based on the SVE vector length.

svfloat32_t enables you to use SVE vectors and predicates directly, without relying on the compiler for autovectorization.

svwhilelt_b32_u32(...)

A function which computes a predicate from two uint32_t integers.

When loading from A and storing to C, svwhilelt_b32_u32(...) ensures you do not read or write past the end of each column.

svld1_f32(...)

A function which loads 32-bit svfloat32_t floats into an SVE vector.

To get the matrix values needed from A. This also takes a predicate to make sure we do not load off the end of the matrix (unpredicated elements are set to zero).

svptrue_b32(...)

A function which sets a predicate for 32-bit values to all-true.

When loading from B, svptrue_b32(...) ensures the vector fills completely because the precondition of calling this function is that the matrix has a dimension which is a multiple of four.

svld1rq_f32(...)

A function which loads an SVE vector with copies of the same 128-bits (four 32-bit values).

To get the matrix values needed from B. Only loads four replicated values because the svmla_lane_f32 instruction only indexes in 128-bit segments.

svmla_lane_f32(...)

A function which uses the fused multiply accumulate instruction. The function multiplies each 128-bit segment of an svfloat32_t value by the corresponding single element of each 128-bit segment of another svfloat32_t. The svmla_lane_f32(...) function then adds the result to a third svfloat32_t before returning the result.

This operation naturally fits the row-on-column dot products because they are a set of multiplications and additions.

svst1_f32(...)

A function which stores svfloat32_t at a given address.

To store the results after they are calculated. The predicate ensures we do not store results past the end of each column.

The important difference here is the ability to ignore one of the dimensions of the matrix because of the variable-length vectors in SVE. Instead, you can explicitly pass the length of the n dimension, and use predication to ensure it is not exceeded.

Example 2: Large matrix multiplication with intrinsics

To multiply larger matrices, treat them as blocks of 4x4 matrices. However, this approach only works with matrix sizes which are a multiple of four in both dimensions. To use this method without changing it, pad the matrix with zeroes.

The Neon code for a more general matrix multiplication is listed below. The structure of the kernel has changed with the addition of loops and address calculations being the major changes. Like in the 4x4 kernel, unique variable names are used for the B columns. The alternative would be to use one variable and re-load it. This acts as a hint to the compiler to assign different registers to these variables. Assigning different registers enables the processor to complete the arithmetic instructions for one column, while waiting on the loads for another.

void matrix_multiply_neon(const float32_t *A, const float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    /*
     * Multiply matrices A and B, store the result in C.
     * It is the users responsibility to make sure the matrices are compatible.
     */

    int a_idx;
    int b_idx;
    int c_idx;

    // these are the columns of a 4x4 sub matrix of A
    float32x4_t A0;
    float32x4_t A1;
    float32x4_t A2;
    float32x4_t A3;

    // these are the columns of a 4x4 sub matrix of B
    float32x4_t B0;
    float32x4_t B1;
    float32x4_t B2;
    float32x4_t B3;

    // these are the columns of a 4x4 sub matrix of C
    float32x4_t C0;
    float32x4_t C1;
    float32x4_t C2;
    float32x4_t C3;

    for (int i_idx=0; i_idx<n; i_idx+=4) {
        for (int j_idx=0; j_idx<m; j_idx+=4) {
            // zero accumulators before matrix op
            C0 = vmovq_n_f32(0);
            C1 = vmovq_n_f32(0);
            C2 = vmovq_n_f32(0);
            C3 = vmovq_n_f32(0);
            for (int k_idx=0; k_idx<k; k_idx+=4){
                // compute base index to 4x4 block
                a_idx = i_idx + n*k_idx;
                b_idx = k*j_idx + k_idx;

                // load most current a values in row
                A0 = vld1q_f32(A+a_idx);
                A1 = vld1q_f32(A+a_idx+n);
                A2 = vld1q_f32(A+a_idx+2*n);
                A3 = vld1q_f32(A+a_idx+3*n);

                // multiply accumulate 4x1 blocks, that is each column C
                B0 = vld1q_f32(B+b_idx);
                C0 = vfmaq_laneq_f32(C0,A0,B0,0);
                C0 = vfmaq_laneq_f32(C0,A1,B0,1);
                C0 = vfmaq_laneq_f32(C0,A2,B0,2);
                C0 = vfmaq_laneq_f32(C0,A3,B0,3);

                B1 = v1d1q_f32(B+b_idx+k);
                C1 = vfmaq_laneq_f32(C1,A0,B1,0);
                C1 = vfmaq_laneq_f32(C1,A1,B1,1);
                C1 = vfmaq_laneq_f32(C1,A2,B1,2);
                C1 = vfmaq_laneq_f32(C1,A3,B1,3);

                B2 = vld1q_f32(B+b_idx+2*k);
                C2 = vfmaq_laneq_f32(C2,A0,B2,0);
                C2 = vfmaq_laneq_f32(C2,A1,B2,1);
                C2 = vfmaq_laneq_f32(C2,A2,B2,2);
                C2 = vfmaq_laneq_f32(C2,A3,B3,3);

                B3 = vld1q_f32(B+b_idx+3*k);
                C3 = vfmaq_laneq_f32(C3,A0,B3,0);
                C3 = vfmaq_laneq_f32(C3,A1,B3,1);
                C3 = vfmaq_laneq_f32(C3,A2,B3,2);
                C3 = vfmaq_laneq_f32(C3,A3,B3,3);
            }
            // compute base index for stores
            c_idx = n*j_idx + i_idx;
            vstlq_f32(C+c_idx, C0);
            vstlq_f32(C+c_idx+n,C1);
            vstlq_f32(C+c_idx+2*n,C2);
            vstlq_f32(C+c_idx+3*n,C3);
        }
    }
}

Compiling and disassembling this function, and comparing it with the C function shows:

  • Fewer arithmetic instructions for a given matrix multiplication, because it utilizes the Advanced SIMD technology with full register packing. Typical C code, generally, does not.

  • FMLA instead of FMUL instructions. As specified by the intrinsics.

  • Fewer loop iterations. When used properly intrinsics allow loops to be unrolled easily.

However, there are unnecessary loads and stores because memory allocation and initialization of data types (for example, float32x4_t) which are not used in the no-instrinsics C code.

Optimizing this code for SVE produces the following code:

void matrix_multiply_sve(const float32_t *A, const float32_t *B, float32_t *C, uint32_t n, uint32_t m, uint32_t k) {
    /*
     * Multiply matrices A and B, store the result in C.
     * It is the users responsibility to make sure the matrices are compatible.
     */

    int a_idx;
    int b_idx;
    int c_idx;

    // these are the columns of a nx4 sub matrix of A
    svfloat32_t A0;
    svfloat32_t A1;
    svfloat32_t A2;
    svfloat32_t A3;

    // these are the columns of a 4x4 sub matrix of B
    svfloat32_t B0;
    svfloat32_t B1;
    svfloat32_t B2;
    svfloat32_t B3;

    // these are the columns of a nx4 sub matrix of C
    svfloat32_t C0;
    svfloat32_t C1;
    svfloat32_t C2;
    svfloat32_t C3;

    for (int i_idx=0; i_idx<n; i_idx+=svcntw()) {
        // calculate predicate for this i_idx
        svbool_t pred = svwhilelt_b32_u32(i_idx, n);

        for (int j_idx=0; j_idx<m; j_idx+=4) {
            // zero accumulators before matrix op
            C0 = svdup_n_f32(0);
            C1 = svdup_n_f32(0);
            C2 = svdup_n_f32(0);
            C3 = svdup_n_f32(0);
            for (int k_idx=0; k_idx<k; k_idx+=4){
                // compute base index to 4x4 block
                a_idx = i_idx + n*k_idx;
                b_idx = k*j_idx + k_idx;

                // load most current a values in row
                A0 = svld1_f32(pred, A+a_idx);
                A1 = svld1_f32(pred, A+a_idx+n);
                A2 = svld1_f32(pred, A+a_idx+2*n);
                A3 = svld1_f32(pred, A+a_idx+3*n);

                // multiply accumulate 4x1 blocks, that is each column C
                B0 = svld1rq_f32(svptrue_b32(), B+b_idx);
                C0 = svmla_lane_f32(C0,A0,B0,0);
                C0 = svmla_lane_f32(C0,A1,B0,1);
                C0 = svmla_lane_f32(C0,A2,B0,2);
                C0 = svmla_lane_f32(C0,A3,B0,3);

                B1 = svld1rq_f32(svptrue_b32(), B+b_idx+k);
                C1 = svmla_lane_f32(C1,A0,B1,0);
                C1 = svmla_lane_f32(C1,A1,B1,1);
                C1 = svmla_lane_f32(C1,A2,B1,2);
                C1 = svmla_lane_f32(C1,A3,B1,3);

                B2 = svld1rq_f32(svptrue_b32(), B+b_idx+2*k);
                C2 = svmla_lane_f32(C2,A0,B2,0);
                C2 = svmla_lane_f32(C2,A1,B2,1);
                C2 = svmla_lane_f32(C2,A2,B2,2);
                C2 = svmla_lane_f32(C2,A3,B3,3);

                B3 = svld1rq_f32(svptrue_b32(), B+b_idx+3*k);
                C3 = svmla_lane_f32(C3,A0,B3,0);
                C3 = svmla_lane_f32(C3,A1,B3,1);
                C3 = svmla_lane_f32(C3,A2,B3,2);
                C3 = svmla_lane_f32(C3,A3,B3,3);
            }
            // compute base index for stores
            c_idx = n*j_idx + i_idx;
            svst1_f32(pred, C+c_idx, C0);
            svst1_f32(pred, C+c_idx+n,C1);
            svst1_f32(pred, C+c_idx+2*n,C2);
            svst1_f32(pred, C+c_idx+3*n,C3);
        }
    }
}

This code is almost identical to the earlier Neon code except for the differing intrinsics, and in addition, thanks to predication, there is no longer a constraint on the number of rows of A. However, you must ensure that the number of columns of A and C, and both dimensions of B, are multiples of four because the predication used above does not account for this. Adding such further predication is possible but would reduce the clarity of this example.

Comparing it with the C function and Neon functions, the SVE example:

  • Uses WHILELT to determine the predicate for doing each iteration of the outer loop. This guarantees you have at least one element to do by the loop condition.

  • Increments i_idx by CNTW (the number of 32-bit elements in a vector) to avoid hard-coding the number of elements done in an iteration of the outer loop.

Program conventions: Macros, Types, and Functions

Macros

In order to use the intrinsics, the Advanced SIMD or SVE architecture must be supported by the compiler. Some specific instructions might not be enabled. When the following macros are defined and equal to 1, the corresponding features are available:

Neon and SVE Macros

Extension

Supported macros

Neon

  • __aarch64__

    • Selection of architecture-dependent source at compile time.

    • Always 1 for AArch64.

  • _ARM_NEON

    • Advanced SIMD is supported by the compiler.

    • Always 1 for AArch64.

  • _ARM_NEON_FP

    • Neon floating-point operations are supported.

    • Always 1 for AArch64

  • _ARM_FEATURE_CRYPTO

    • Crypto instructions are available.

    • Cryptographic Neon intrinsics are therefore available.

  • _ARM_FEATURE_FMA

    • The fused multiply-accumulate instructions are available.

    • Neon intrinsics which use these are therefore available.

SVE

  • _ARM_FEATURE_SVE

    • Always 1 if SVE is supported.

    • The SVE instructions are available.

    • SVE intrinsics which use these are therefore available.

This list is not exhaustive and further macros are detailed on the Arm C Language Extensions web page.

Types

Neon and SVE Types

Extension

Supported types

Neon

There are three major categories of Neon data type available in arm_neon.h which follow these patterns:

  • baseW_t

    Scalar data types. For example, int64_t.

  • baseWxL_t

    Vector data types. For example, int32x2_t.

  • baseWxLxN_t

    Vector array data types. For example, int16x4x2_t.

Where:

  • base refers to the fundamental data type.

  • W is the width of the fundamental type.

  • L is the number of scalar data type instances in a vector data type, for example an array of scalars.

  • N is the number of vector data type instances in a vector array type, for example a struct of arrays of scalars.

Generally, W and L are values where the vector data types are 64 bits or 128 bits long, and so fit completely into a Neon register. N corresponds with those instructions which operate on multiple registers at once.

SVE

There is no existing mechanism that maps directly to the concept of an SVE vector or predicate. The ACLE takes the first approach and classifies SVE vectors and predicates as belonging to a new category of type called sizeless data types. Sizeless data types are composed of vector types and predicate types and are pre-pended with sv, for example svint64_t.

  • baseW_t

    Scalar data types. SVE adds support for float16_t, float32_t, and float64_t.

  • svbaseW_t

    Sizeless scalar data types for single vectors. For example, svint64_t.

  • svbaseWxN_t

    Sizeless vector data types for two, three, and four vectors. For example, svint64x2_t.

  • svbool_t

    Sizeless single predicate data type which has enough bits to control an operation on a vector of bytes.

Where:

  • base refers to the fundamental data type.

  • bool refers to the bool type from stdbool.h.

  • W is the width of the fundamental type.

  • N is the number of vector data type instances in a vector array type, for example a struct of arrays of scalars.

Functions

For Neon, similar to the Arm C Language Extensions, the function prototypes from arm_neon.h follow a common pattern. At the most general level, this is:

ret v[p][q][r]name[u][n][q][x][_high][_lane | laneq][_n][_result]_type(args)

For example:

int8x16_t vmulq_s8 (int8x16_t a, int8x16_t b)

The mul in the function name is a hint that this intrinsic uses the MUL instruction. The types of the arguments and the return value (sixteen bytes of signed integers) map to the following instruction:

MUL Vd.16B, Vn.16B, Vm.16B

This function multiplies corresponding elements of a and b and returns the result.

Some of the letters and names are overloaded, but in the order above:

ret

The return type of the function.

v

Short for vector and is present on all the intrinsics.

p

Indicates a pairwise operation. ([value] means value might be present).

q

Indicates a saturating operation (except for vqtb[l][x] in AArch64 operations, where the q indicates 128-bit index and result operands).

r

Indicates a rounding operation.

name

The descriptive name of the basic operation. Often, this is an Advanced SIMD instruction, but it does not have to be.

u

Indicates signed-to-unsigned saturation.

n

Indicates a narrowing operation.

q

Postfixing the name indicates an operation on 128-bit vectors.

x

Indicates an Advanced SIMD scalar operation in AArch64. It can be one of b, h, s, or d (that is, 8, 16, 32, or 64 bits).

_high

In AArch64, used for widening and narrowing operations involving 128-bit operands. For widening 128-bit operands, high refers to the top 64-bits of the source operand (or operands). For narrowing, it refers to the top 64-bits of the destination operand.

_n

Indicates a scalar operand that is supplied as an argument.

_lane

Indicates a scalar operand taken from the lane of a vector. _laneq indicates a scalar operand taken from the lane of an input vector of 128-bit width. ( left | right means only left or right would appear).

type

The primary operand type in short form.

args

The arguments of the function.

For SVE, the function prototypes from arm_sve.h follow a common pattern. At the most general level, this is:

svbase[_disambiguator][_type0][_type1]...[_predication]

For example, svclz[_u16]_m says that the full name is svclz_u16_m and that its overloaded alias is svclz_m.

Where:

base

The lower-case name of an SVE instruction, with some adjustments.

_disambiguator

Distinguishes between different forms of a function.

_type0|_type1|...

List the types of vectors and predicates, starting with the return type and continuing with the argument types.

_predication

This suffix describes the inactive elements in the result of a predicated operation. It can be one of z (zero predication), m (merge predication), or x ('Do not care' predication).

For more information about the individual function parts, see Arm C Language Extensions for SVE specification.

Resources

Neon:

SVE:

Was this page helpful? Yes No