This topic is a brief introduction to Scalable Vector Extension (SVE).
Scalable Vector Extension (SVE) is the next-generation SIMD extension of the Armv8-A A64 instruction set. It is not an extension of NEON, but a new set of vector instructions developed to target HPC workloads. SVE enables vectorization of loops which would either be impossible or not beneficial to vectorize with NEON.
Unlike other SIMD architectures, SVE can be Vector Length Agnostic; it does does not fix the size of the vector registers leaving hardware implementors free to choose the size best suited to the intended workloads.
The SVE instruction set introduces the following new architectural features for High Performance Computing (HPC):
- Scalable vector length
Vector code allows each implementation to automatically choose its vector length when it is a multiple of 128 bits and does not exceed the architectural maximum of 2048 bits. SVE provides 32 scalable vector registers, named Z0 – Z31.
- Per-lane predication
SVE provides 16 predicate registers, named p0-p15, with each predicate register being 1/8th of the size of the vector register (1 bit per byte), and therefore scalable in size. Predicate registers are written to using condition-creating instructions, such as compares. This allows subsequent instructions to control which elements (or ‘lanes’) of a vector should be operated upon (the active elements).
- Gather-load and scatter-store
Allows data to be efficiently transferred to or from a vector of non-contiguous memory addresses. This enables a significantly wider range of source code constructs to be vectorized. To permit efficient accesses to contiguous memory data, SVE provides a rich set of load and store instructions which progress sequentially forwards through an array, supporting a full range of packed 8, 16, 32 and 64-bit vector element organizations.
- Vector partitioning
A vector partition is the dynamically determined portion of a vector defined by a predicate register. SVE permits the progression of a loop one partition at a time, until the whole vector has been processed or the loop has reached its natural conclusion.
- Fault-tolerant speculative vectorization
Causes memory faults to be suppressed if they do not occur because of the first active element of the vector, and instead generates a predicate value indicating which of the requested lanes were successfully loaded prior to the first memory fault. This allows loops with conditional exits or unknown trip-counts to be safely vectorized, maintaining the same faulting behavior as if they had been executed sequentially. A common use for fault-tolerant speculative vectorization is in C strings.
- Horizontal vector operations
SVE has a family of horizontal reduction instructions which include integer and floating-point summation, minimum, maximum, and bitwise logical reductions.
- Serialized vector operations
SVE allows pointer chasing loops to be performed serially using a vector of addresses and an associated predicate, allowing the remainder of the loop to be parallelized.
The instruction set operates on a new set of vector and predicate registers:
1 First Faulting Register (FFR) register.
Z registers are data registers. The architecture specifies that
their size in bits must be a multiple of 128, from a minimum of 128
bits to an implementation-defined maximum of up to 2048 bits. Data in
these registers can be interpreted as 8-bit bytes, 16-bit halfwords,
32-bit words or 64-bit doublewords. For example, a 384-bit
implementation of SVE can hold 48 bytes, 24 halfwords, 12 words, or 6
doublewords of data. It is also important to mention that the low 128
bits of each
Z register overlap the corresponding NEON registers
of the Advanced SIMD extension and therefore also the scalar
P registers are predicate registers, which are unique to SVE,
and hold one bit for each byte available in a
Z register. For
example, an implementation providing 1024-bit
Z registers provides
128-bit predicate registers.
The FFR register is a special predicate register that differs from regular predicate registers by way of being used implicitly by some dedicated instructions, called first faulting loads.
Individual predicate bits encode a boolean true or false, but a predicate lane, which contains between one and eight predicate bits is either active or inactive, depending on the value of its least significant bit.
Similarly, in this document the terms active or inactive lane will be used to qualify the lanes of data registers under the control of a predicate register.
The SVE assembly language is designed to closely mirror the AArch64 NEON mnemonics and operand syntax. However, SVE has significant differences which require extensions to the A64 assembly language, as follows:
- New register files for vectors and predicates
Adds the register names z0-z31 and p0-p15.
- Vector and predicate registers have unknown size
The element count is absent from a SVE vector or predicate shape suffix.
- A predicate is a “bit mask”
SVE-capable assemblers report any inconsistencies between size suffixes and other operands as an error.
- Zeroing or merging predication
Predicated instructions either zero the values of inactive lanes, ‘zeroing form’, or merge in the prior values, ‘merging form’. These instructions have a suffix that indicates which form is being used.
- Destructive encodings
Many instructions have destructive two-operand forms where the destination register also contains one of the source operands. To avoid ambiguity, the syntax uses a three-operand constructive notation, with the destructive operand being repeated in both the destination and source positions.
- Gather-scatter addressing
The A64 load/store address syntax is extended to allow vector operands within the address specifier.
- Predicate / vector condition codes
Adds a new set of aliases for condition codes for use in SVE assembler source and disassembly.
SVE introduces a variety of instructions that operate on the data and predicate registers. There are two main classes of instructions; predicated and unpredicated. Instructions that use a predicate register to control the lanes they operate on, versus those that do not. In a predicated instruction, only the active lanes of vector operands are processed and can generate side effects - such as memory accesses and faults, or numeric exceptions.
Across these two main classes, there are data processing instructions, that operate on Z registers (for example, addition), predicate generation instructions, such as numeric comparisons that operate on data registers and produce predicate registers, or predicate manipulation instructions, that mostly cover predicate generation or logical operations on predicates.
Only predicate registers p0 through p7 are usable as predicates in data-processing instructions.
Most data manipulation operations cover both floating point (FP) and integer domains, with some notable FP functionality brought by the ordered horizontal reductions, which provide cross-lane operations that preserve the strict C/C++ rules on non-associativity of floating-point operations.
A significant proportion of the new instruction set is dedicated to vector load/store instructions, which can perform signed or unsigned extension or truncation of the data, and come with a wide range of new addressing modes that improve the efficiency of SVE code.
SVE instructions can be separated by function into the following groups. For a more detailed description of the instructions, see the ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A document.
- Load, store, and prefetch instructions
SVE vector load and store instructions transfer data in memory to or from elements of one or more vector or predicate registers. SVE also includes vector prefetch instructions that provide read and write hints to the memory system. Instructions include:
Predicated single vector contiguous element accesses
Predicated non-contiguous element accesses
Predicated multiple vector contiguous structure load/store
Predicated replicating element loads
Unpredicated vector register load/store
Unpredicated predicate register load/store
Unpredicated vector register load/store do not have endianness conversion, and should not be used for your code.
- Vector move operations
Vector move instructions copy data from scalar registers, immediate values, and other vectors to selected vector elements. Instructions include:
Element move and broadcast
- Integer operations
The integer instructions operate on signed or unsigned integer data within a vector. Instructions include:
Integer dot product
- Vector address calculation
The vector address calculation instructions compute vectors of addresses and addresses of vectors. This includes instructions to add a multiple of the current vector length or predicate register length, in bytes, to a general-purpose register.
- Bitwise operations
The bitwise instructions perform bitwise operations on vectors. Instructions include:
Bitwise shift, reverse, and count
- Floating-point operations
The floating-point instructions operate on floating-point data within a vector. Instructions include:
Floating-point multiply accumulate
Floating-point complex arithmetic
Floating-point rounding and conversion
Floating-point transcendental acceleration
Floating-point indexed multiplies
- Predicate operations
The predicate instructions relate to operations that manipulate the predicate registers. Instructions include:
Predicate move operations
Predicate logical operations
FFR predicate handling
- Move operations
These instructions move data between different vector elements, or between vector elements and scalar registers. Instructions include:
Element permute and shuffle
Index vector generation
- Reduction operations
Horizontal reduction instructions perform arithmetic horizontally across active elements of a single source vector and deliver a scalar result. Instructions include:
The aim of the Arm C language extensions (ACLE) is to make features of the Arm architecture directly available in C and C++ programs. The core ACLE is defined in a dedicated document, while the ACLE for Arm SVE document defines the part that is specific to the Arm Scalable Vector Extension (SVE).
General information on Arm C Language Extensions is available on the ACLE Developer web page.