## SVE2 architecture fundamentals

This section introduces the basic architecture features that SVE and SVE2 share.

Like SVE, SVE2 is based on the scalable vectors. In addition to the existing register banks that Neon provides, SVE and SVE2 adds the following registers:

• 32 scalable vector registers, `Z0-Z31`
• 16 scalable predicate registers, `P0-P15`
• One First Fault predicate Register (FFR)
• Scalable vector system control registers `ZCR_Elx`

Let’s look at each of these in turn.

### Scalable vector registers z0-z31

Each of the scalable vector registers, `Z0-Z31`, can be 128-2048 bits, with 128 bits increments. The bottom 128 bits are shared with the fixed 128-bit long `V0-V31` vectors of Neon.

The figure below shows the scalable vector registers `Z0-Z31`:

Scalable vector registers `Z0-Z31`

The scalable vectors can:

• Hold 64, 32, 16, and 8-bit elements
• Support integer, double-precision, single-precision, and half-precision floating-point elements
• Be configured with the vector length in each Exception level (EL)

### Scalable predicate registers P0-P15

The figure below shows the scalable predicate registers `P0-P15`:

Scalable predicate registers `P0-P15`

The predicate registers are usually used as bit masks for data operations, where:

• Each predicate register is 1/8 of the `Zx` length.
• `P0-P7` are governing predicates for load, store, and arithmetic.
• `P8-P15` are extra predicates for loop management.
• First Fault Register (FFR) is for Speculative memory accesses.

If the predicate registers are not used as bit masks, they are used as operands.

### Scalable vector system control registers ZCR_Elx

The figure below shows the scalable vector system control registers `ZCR_Elx`:

Scalable vector system control registers `ZCR_Elx`

The scalable vector system control registers indicate the SVE implementation features:

• The `ZCR_Elx.LEN` field is for the vector length of the current and lower exception levels
• Most bits are currently reserved for future use.

### SVE2 assembly syntax

SVE2 follows the same assembly syntax format that SVE follows. The following instruction examples show this format.

Example 1:

`LDFF1D {<Zt>.D}, <Pg>/Z, [<Xn|SP>, <Zm>.D, LSL #3]`

Where:

• `Zt` are the vectors, `Z0-Z31`
• D, vector and predicate registers have known element type but unknown element numbers
• `Pg` are the predicates, `P0-P15`
• `Z` is the zeroing predication
• `Zm` is gather-scatter or vector addressing

Example 2:

`ADD <Zdn>.<T>, <Pg>/M, <Zdn>.<T>, <Zm>.<T>`

Where:

• `M` is the merging predication

Example 3

`ORRS <Pd>.B, <Pg>.Z, <Pn>.B, <Pm>.B`

Where:

• `S` is a new interpretation of predicate condition flags `NZCV`
• `Pg`, a predicate, is a “bit mask”.

Key SVE architecture features that SVE2 inherits:

### SVE2 architecture features

SVE2 inherits the following important SVE architecture features:

• Gather-load and scatter-store

The flexible address mode in SVE allows vector base address or vector offset, which enables loading to a single vector register from non-contiguous memory locations. For example:

```LD1SB  Z0.S, P0/Z, [Z1.S, #4]   // Gather load of signed bytes to active 32-bit elements of Z0 from memory addresses generated by 32-bit vector base Z1 plus immediate index #4.
LD1SB  Z0.D, P0/Z, [X0, Z1.D]  // Gather load of signed bytes to active elements of Z0 from memory addresses generated by a 64-bit scalar base X0 plus vector index in Z1.D.```
• Per-lane predication

To allow flexible operations on selected elements, SVE and SVE2 introduce 16 governing predicate registers, `P0-P15`, to indicate the valid operation on active lanes of the vectors.  For example:

`ADD Z0.D, P0/M, Z1.D, Z2.D  // Add the active elements Z1 and Z2 and put the result in Z0. P0 indicates which elements of the operands are active and inactive. ‘M’ after P0 indicates that the inactive element will be merged, meaning Z0 inactive element will remain its original value before the ADD operation. If it was ‘Z’ after P0, then it would mean that inactive element will be zeroed in the destination vector register.`
• Predicate-driven loop control and management

Predicate-driven loop control and management is an efficient loop control feature. This feature allows loop heads and tails overhead, caused by the processing of partial vectors, to be removed by registering the active and inactive elements index in the predicate registers. This means that, in the next loop, only the active elements do the expected options. For example:

`WHILEL0 P0.S, x8, x9  // Generate a predicate in P0 that starting from the lowest numbered element is true while the incrementing value of the first, unsigned scalar X8 operand is lower than the second scalar operand X9 and false thereafter, up to the highest numbered element.`
• Vector partitioning for software-managed speculation

SVE improved the Neon vectorization restrictions on Speculative load. SVE introduces the first-fault vector load instructions, for example `LDRFF`, and the First-Fault predicate Registers (FFRs) to allow vector accesses to cross into invalid pages. For example:
```LDFF1D Z0.D, P0/Z, [Z1.D, #0]  // Gather load with first-faulting behaviour of doublewords to active elements of Z0 from memory addresses generated by the vector base Z1 plus 0. Inactive elements will not read Device memory or signal faults and are set to zero in the destination vector. Successful load to the valid memory will set true to the first-fault register (FFR), and the first-faulting load will set false to the according element and the rest elements in FFR.
RDFFR P0.B  // Read the first-fault register (FFR) and place in the destination predicate without predication.```
• Extended floating-point and bitwise horizontal reductions

SVE enhances floating-point and bitwise horizontal reduction operations. Examples of these operations include in-order or tree-based floating-point sum. These operations trade off repeatability and performance. Here is some example code:

`FADDP  Z0.S, P0/M, Z1.S, Z2.S  // Add pairs of adjacent floating-point elements within each source vector Z1 and Z2, and interleave the results from corresponding lanes. The interleaved result values are destructively placed in the first source vector Z0.`