Like SVE, SVE2 is based on the scalable vectors. The additional register banks that SVE and SVE2 add, include: 32 scalable vector registers Z0-Z31, 16 scalable predicate registers P0-P15, one First Fault predicate Register (FFR), and scalable vector system control registers ZCR_Elx.
Scalable vector registers Z0-Z31
Each of the vectors can be 128-2048 bits with 128 bits increments. The bottom 128 bits are shared with the fixed 128-bit long V0-V31 vectors of Neon. The scalable vectors can:
- Hold 64, 32, 16, and 8-bit elements.
- Support integer, double-precision, single-precision, and half-precision floating-point elements.
- Be configured with the vector length in each Exception Level (EL).
Scalable predicate registers P0-P15
The predicate registers are mainly used as bit masks for data operations, where:
- Each predicate register is 1/8 of the
- P0-P7 are governing predicates for load, store and arithmetic.
- P8-P15 are additional predicates for loop management.
- FFR First Fault Register is for speculative memory accesses.
If the predicate registers are not used as bit masks, they are used as operands.
Scalable vector system control registers
The scalable vector system control registers indicate the SVE implementation features.
- ZCR_Elx.LEN field for the vector length of the current and lower exception levels.
- Most bits are currently reserved for future use.
SVE2 assembly syntax
SVE2 follows the same SVE assembly syntax format, which is shown in the instruction examples below:
For more information and examples see the Arm Architecture Reference Manual Supplement – The Scalable Vector Extension (SVE) for Armv8-A.
Key SVE architecture features that SVE2 inherits:
The flexible address mode in SVE allows vector base address or vector offset, which enables loading to a single vector register from non-contiguous memory locations. For example:
LD1SB Z0.S, P0/Z, [Z1.S, #4] // Gather load of signed bytes to active 32-bit elements of Z0 from memory addresses generated by 32-bit vector base Z1 plus immediate index #4. LD1SB Z0.D, P0/Z, [X0, Z1.D] // Gather load of signed bytes to active elements of Z0 from memory addresses generated by a 64-bit scalar base X0 plus vector index in Z1.D.
Operate on individual lanes of vector controlled by a governing predicate register P0-P15. For example:
ADD Z0.D, P0/M, Z1.D, Z2.D // Add the active elements Z1 and Z2 and put the result in Z0. P0 indicates which elements of the operands are active and inactive. ‘M’ after P0 indicates that the inactive element will be merged, meaning Z0 inactive element will remain its original value before the ADD operation. If it was ‘Z’ after P0, then it would mean that inactive element will be zeroed in the destination vector register.
Eliminate loop heads and tails and other overhead by processing partial vectors, by registering the active and inactive elements index in the predicate registers, so that in the next loop only the active elements do the expected options. For example:
WHILEL0 P0.S, x8, x9 // Generate a predicate in P0 that starting from the lowest numbered element is true while the incrementing value of the first, unsigned scalar X8 operand is lower than the second scalar operand X9 and false thereafter, up to the highest numbered element.
SVE improved the Neon vectorization restrictions on speculative load. SVE introduces the first-fault vector load instructions such as
LDRFFand first-fault predicate registers FFR to allow vector accesses to cross into invalid pages. For example:
LDFF1D Z0.D, P0/Z, [Z1.D, #0] // Gather load with first-faulting behaviour of doublewords to active elements of Z0 from memory addresses generated by the vector base Z1 plus 0. Inactive elements will not read Device memory or signal faults and are set to zero in the destination vector. Successful load to the valid memory will set true to the first-fault register (FFR), and the first-faulting load will set false to the according element and the rest elements in FFR.
RDFFR P0.B // Read the first-fault register (FFR) and place in the destination predicate without predication.
In-order or tree-based floating-point sum, trade-off repeatability versus performance. For example:
FADDP Z0.S, P0/M, Z1.S, Z2.S // Add pairs of adjacent floating-point elements within each source vector Z1 and Z2, and interleave the results from corresponding lanes. The interleaved result values are destructively placed in the first source vector Z0.