SVE architecture fundamentals

This section introduces the basic architecture features of SVE.

SVE is based on a set of scalable vectors. SVE adds the following registers:

  • 32 scalable vector registers,Z0-Z31
  • 16 scalable predicate registers, P0-P15
  • One First Fault predicate Register (FFR)
  • Scalable vector system control registers ZCR_Elx

Let us look at each of these registers in turn.

Scalable vector registers z0-z31

The scalable vector registers Z0-Z31 can be implemented with 128-2048 bits in microarchitectures. The bottom 128 bits are shared with the fixed 128-bit V0-V31 vectors of Neon.

The figure below shows the scalable vector registers Z0-Z31:

zo-z31-registers

 

The scalable vectors:

  • Can hold 64, 32, 16, and 8-bit elements
  • Support integer, double-precision, single-precision, and half-precision floating-point elements
  • Are configurable with the vector length for each Exception Level (EL)

Scalable predicate registers P0-P15

To govern which active elements are involved in the operations, the predicate registers are used in many SVE instructions as masks, which also gives flexibility to the vector operation. The figure below shows the scalable predicate registers P0-P15:

Scalable predicate registers p0-p15

The predicate registers are usually used as bit masks for data operations, where:

  • Each predicate register is 1/8 of the Zx length.
  • P0-P7 are governing predicates for load, store, and arithmetic.
  • P8-P15 are extra predicates for loop management.
  • First Fault Register (FFR) is a special predicate register, which is set by the first-fault load and store instructions, to indicate how successful the load and store operation for each element is. FFR is designed to support speculative memory accesses which make the vectorization, in many situations, easier and safer.

The predicate registers can also be used as operands in various SVE instructions.

Configurable vector length

Within the maximum implemented vector length, it is also possible to configure the length of the vector for each Exception level through the ZCR_Elx registers. The length implementation and configuration need to meet the minimum requirements from the AArch64 SVE Supplement, so that either of the following conditions are met:

  • An implementation must allow the vector length to be constrained to any power of two.
  • An implementation allows the vector length to be constrained to multiples of 128 that are not a power of two.

Privileged Exception levels can use the LEN fields of the scalable vector control registers ZCR_El1, ZCR_El2, and ZCR_El3 to constrain the vector length at that Exception level and at less privileged Exception levels:

Scalable vector control registers zcr-elx

The scalable vector system control registers indicate the SVE implementation features:

  • The ZCR_Elx.LEN field is for the vector length of the current and lower exception levels.
  • Most bits are currently reserved for future use.

SVE assembly syntax

SVE assembly syntax format is composed of operation code, destination register, predicate register (if the instruction supports predicate masks), and input operators. The following instruction examples show the detail of this format.

Example 1:

LDFF1D {<Zt>.D}, <Pg>/Z, [<Xn|SP>, <Zm>.D, LSL #3]

Where:

  • <Zt> are the vectors, Z0-Z31
  • <Zt>.D and <Zm>.D specify the element types of the destination and operand vectors and do not need to specify the element numbers
  • <Pg> are the predicates, P0-P15
  • <Pg>/Z is the zeroing predication
  • <Zm> specifies the offset of the address mode for the gather-load

     

Example 2:

ADD <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>

Where:

  • M is the merging predication
  • <Zdn> is both the destination register and one of the input operators. The instruction syntax shows <Zdn> at both places for your convenience. In assembly encoding, they are encoded once, for simplification.

Example 3

ORRS <Pd>.B, <Pg>.Z, <Pn>.B, <Pm>.B

Where:

  • S is a new interpretation of predicate condition flags NZCV
  • <Pg> a governing predicate acts a “bit mask” in the example operation.

SVE architecture features

SVE includes the following key architecture features:

  • Per-lane predication

    To allow flexible operations on selected elements, SVE introduces 16 governing predicate registers, P0-P15, to indicate the valid operation on active lanes of the vectors.For example:

    ADD Z0.D, P0/M, Z1.D, Z2.D  // Add the active elements Z1 and Z2 and put the result in Z0. P0 indicates which elements of the operands are active and inactive. ‘M’ after P0 refers to Merging, which indicates that the inactive element will be merged and as a result Z0 inactive element will remain its original value after the ADD operation. If it was ‘Z’ after P0, which refers to Zeroing, then the inactive element of the destination register will be zeroed after the operation.
    Per lane predication merging
    If the predicate specification is ‘/Z’, then the operation does zeroing to the results of the corresponding elements of the destination vector, where the predicate elements are zero. For example:
    CPY Z0.B, P0/Z, #0xFF //Copy a signed integer 0xFF into Z0, where the inactive elements of Z0.B will be set to zero.
    Per lane predication zeroing

    Not all instructions have predicate options. Also, not all predicate operations have both merging and zeroing options. You must refer to the SVE Architecture Supplement for the specification details of each instruction.

  • Gather-load and scatter-store

    The address mode in SVE allows the vector to be used as the base address and the offset in the Gather-load and Scatter-store instructions, which enables non-contiguous memory locations. For example:
    LD1SB  Z0.S, P0/Z, [Z1.S]   // Gather load of signed bytes to active 32-bit elements of Z0 from memory addresses generated by 32-bit vector base Z1.
    LD1SB  Z0.D, P0/Z, [X0, Z1.D]  // Gather load of signed bytes to active elements of Z0 from memory addresses generated by a 64-bit scalar base X0 plus vector index in Z1.D.

    The following example shows the loading operation of LD1SB Z0.S, P0/Z, [Z1.S], where P0Z1 contains scattered addresses. After loading, the bottom byte of each Z0.S is updated with the fetched data from the scattered memory location.

    Gather-load and scappter-store example

     

  • Predicate-driven loop control and management

    As a key feature of SVE, predication does not only give the flexibility of controlling individual elements of the vector operation, but also enables the predicate-driven loop control. Predicate-driven loop control and management make the loop control efficient and flexible. This feature removes the overhead of processing the extra loop heads and tails of partial vectors, by registering the active and inactive elements index in the predicate registers. Predicate-driven loop control and management means that, in the following loop iteration, only the active elements do the expected options. For example:

    WHILEL0 P0.S, x8, x9  // Generate a predicate in P0 that starting from the lowest numbered element is true while the incrementing value of the first, unsigned scalar X8 operand is lower than the second scalar operand X9 and false thereafter, up to the highest numbered element. 
    B.FIRST Loop_start // B.FIRST (equivalent to B.MI) or B.NFRST (equivalent to B.PL) are often used to branch based on the above instruction test results of whether the first element of P0 is true or false as an ending or continue condition of a loop.
    Predicate-driven loop control and management example
  • Vector partitioning for software-managed speculation

    Speculative loads can cause challenges to the memory read of a traditional vector, where if any fault occurs in some elements during the read, it is difficult to reverse the load operation and track which elements failed the loading. Neon does not allow speculative load. To allow speculative loads to vectors, for example LDRFF, SVE introduces the first-fault vector load instructions. To allow vector accesses to cross into invalid pages, SVE also introduces the First-Fault predicate Registers (FFRs). When loading to an SVE vector with first-fault vector load instructions, the FFR register updates with the load success or fail result for each element. When a load fault occurs, FFR immediately registers the corresponding element, registers the rest of the elements as 0 or false, and does not trigger an exception. Commonly, RDFFR instructions are used to read the FFR status. When the first element is false, RDFFR instructions finish the iterations. If the first element is true, RDFFR instructions continue the iterations. The length of FFR is the same as a predicate vector. The value can be initialized with SETFFR instruction. The following example uses LDFF1D to read from memory, and the FFR updates correspondingly:

    LDFF1D Z0.D, P0/Z, [Z1.D, #0]  // Gather load with first-faulting behaviour of doublewords to active elements of Z0 from memory addresses generated by the vector base Z1 plus 0. Inactive elements will not read Device memory or signal faults and are set to zero in the destination vector. Successful loads from the valid memory will set true to the elements in FFR. The first-faulting load will set false or 0 to the corresponding element and the rest of the elements in FFR.

     Vector partitioning for software-managed speculation example
  • Extended floating-point and horizontal reductions

    To allow efficient reduction operations in a vector, and meet different requirements to the accuracy, SVE enhances floating-point and horizontal reduction operations. The instructions might have in-order (low to high) or tree-based (pairwise) floating-point reduction ordering, where the operation ordering might result in different rounding results. These operations trade-off repeatability and performance. For example:

    FADDA  D0, P0/M, D1, Z2.D  // Floating-point add strictly-ordered reduction from low to high elements of the vector source, accumulating the result in a SIMD&FP scalar register. The example instruction adds D1 and all active elements of Z2.D and place the result in scalar register D0. Vector elements are process strictly in order from low to high, with the scalar source D1 providing the initial value. Inactive elements in the source vector are ignored. Whereas FADDV would perform a recursive pairwise reduction, and put the result in a scalar register. 

    Extended floating-point and horizontal reductions example

Previous Next