What is the Scalable Vector Extension?
This topic is a short introduction to the Scalable Vector Extension (SVE).
Scalable Vector Extension (SVE) is the next-generation SIMD extension of the Arm®v8-A AArch64 instruction set. SVE is not an extension of Neon, but a new set of vector instructions that are developed to target HPC workloads. SVE enables vectorization of loops which would either be impossible or not beneficial to vectorize with Neon™.
Unlike other SIMD architectures, SVE can be Vector Length Agnostic (VLA). SVE does not fix the size of the vector registers which allows hardware implementors to choose the size that is best for their workloads.
The SVE instruction set introduces the following new architectural features for High Performance Computing (HPC):
- Scalable vector length
Vector code allows each implementation to automatically choose its vector length when it is a multiple of 128 bits and does not exceed the architectural maximum of 2048 bits. SVE provides 32 scalable vector registers, named Z0 - Z31.
- Per-lane predication
SVE provides 16 predicate registers, named p0-p15, with each predicate register being 1/8th of the size of the vector register (1 bit per byte), and therefore scalable in size. Predicate registers are written to use condition-creating instructions, such as compares. Condition-creating instructions allow later instructions to control which elements (or 'lanes') that a vector should be operated on (the 'active' elements).
- Gather-load and scatter-store
Gather-load and scatter-store allows data to be efficiently transferred to or from a vector of non-contiguous memory addresses. The efficient transfer of data enables a wider range of source code constructs to be vectorized. To permit efficient accesses to contiguous memory, SVE provides an extensive set of load and store instructions which progress sequentially forwards through an array, supporting a full range of packed 8, 16, 32, and 64-bit vector element organizations.
- Vector partitioning
A vector partition is the dynamically-determined portion of a vector defined by a predicate register. SVE permits the progression of a loop one partition at a time, until the whole vector has been processed or the loop has reached its natural conclusion.
- Fault-tolerant speculative vectorization
Fault-tolerant speculative vectorization suppresses memory faults if they do not occur because of the first active element of the vector. Instead, fault-tolerant speculative vectorization generates a predicate value indicating which of the requested lanes were successfully loaded prior to the first memory fault. This indication allows loops with conditional exits or unknown trip-counts to be safely vectorized, maintaining the same faulting behavior as if they had been executed sequentially. A common use for fault-tolerant speculative vectorization is in C strings.
- Horizontal vector operations
SVE has a family of horizontal reduction instructions which include integer and floating-point summation, minimum, maximum, and bit-wise logical reductions.
- Serialized vector operations
SVE allows you to perform serial pointer-chasing loops and to use a vector of addresses with an associated predicate, allowing you to parallelize the remainder of the loop.
The instruction set operates on a new set of vector and predicate registers:
1 First Faulting Register (FFR) register.
Z registers are data registers. The architecture specifies that
their size in bits must be a multiple of 128, from a minimum of 128
bits to an implementation-defined maximum of up to 2048 bits. Data in
these registers can be interpreted as 8-bit bytes, 16-bit halfwords,
32-bit words or 64-bit doublewords. For example, a 384-bit
implementation of SVE can hold 48 bytes, 24 halfwords, 12 words, or 6
doublewords of data. The low 128 bits of each
Z register overlap the corresponding Neon registers
of the Advanced SIMD extension, and therefore also the scalar floating-point registers:
P registers are 'predicate' registers, which are unique to SVE,
and hold one bit for each byte available in a
Z register. For
example, an implementation providing 1024-bit
Z register provides
128-bit predicate registers.
The FFR register is a 'special' predicate register that differs from regular predicate registers because it is used implicitly by some dedicated instructions, called first faulting loads.
Individual predicate bits encode a Boolean true or false, but a predicate lane, which contains between one and eight predicate bits is either 'active' or 'inactive', depending on the value of its least significant bit.
Similarly, in this document the terms 'active' or 'inactive' lane are used to qualify the lanes of data registers under the control of a predicate register.
The SVE assembly language is designed to closely mirror the AArch64 Neon™ mnemonics and operand syntax. However, SVE has significant differences which require extensions to the A64 assembly language:
- New register files for vectors and predicates
Adds the register names z0-z31 and p0-p15.
- Vector and predicate registers have unknown size
The element count is absent from a SVE vector or predicate shape suffix.
- A predicate is a "bit mask"
SVE-capable assemblers report any inconsistencies between size suffixes and other operands as an error.
- Zeroing or merging predication
Predicated instructions either zero the values of inactive lanes, 'zeroing form', or merge in the prior values, 'merging form'. These instructions have a suffix that indicates which form is being used.
- Destructive encodings
Many instructions have destructive two-operand forms where the destination register also contains one of the source operands. To avoid ambiguity, the syntax uses a three-operand constructive notation, with the destructive operand being repeated in both the destination and source positions.
- Gather-scatter addressing
The A64 load/store address syntax is extended to allow vector operands within the address specifier.
- Predicate / vector condition codes
Adds a new set of aliases for condition codes for use in SVE assembler source and disassembly.
SVE instruction set
SVE introduces various instructions that operate on the data and predicate registers. There are two main classes of instructions: 'predicated' and 'unpredicated'. Instructions that use a predicate register to control the lanes they operate on, versus those that do not. In a predicated instruction, only the active lanes of vector operands are processed and can generate side effects - such as memory accesses and faults, or numeric exceptions.
Across these two main classes, there are: * 'data processing' instructions that operate on Z registers (for example, addition). * 'predicate generation' instructions that operate on data registers and produce predicate registers (for example, numeric comparisons). * 'predicate manipulation' instructions, that include predicate generation or logical operations on predicates.
You can only use the predicate registers
p7 as predicates in data-processing instructions.
Most data manipulation operations cover both floating-point (FP) and integer domains, with some notable FP functionality that is brought by the ordered horizontal reductions. Ordered horizontal reductions provide cross-lane operations that preserve the strict C/C++ rules on non-associativity of floating-point operations.
A large proportion of the new instruction set is dedicated to vector load/store instructions, which can perform 'signed' or 'unsigned' 'extension' or 'truncation' of the data. Vector load/store instructions also have a wide range of new addressing modes that improve the efficiency of SVE code.
SVE instructions can be separated by function:
For a more detailed description of the instructions, see the ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A document.
- Load, store, and prefetch instructions
SVE vector load and store instructions transfer data in memory to, or from, elements of one or more vector or predicate registers. SVE also includes vector prefetch instructions that provide read and write hints to the memory system. Instructions include:
Predicated single vector contiguous element accesses.
Predicated non-contiguous element accesses.
Predicated multiple vector contiguous structure load/store.
Predicated replicating element loads.
Unpredicated vector register load/store.
Unpredicated predicate register load/store.
Unpredicated vector register load/store do not have endianness conversion, and should not be used for your code.
- Vector move operations
Vector move instructions copy data from scalar registers, immediate values, and other vectors to selected vector elements. Instructions include:
Element move and broadcast.
- Integer operations
The integer instructions operate on signed or unsigned integer data within a vector. Instructions include:
Integer dot product.
- Vector address calculation
The vector address calculation instructions compute vectors of addresses and addresses of vectors. This includes instructions to add a multiple of the current vector length or predicate register length, in bytes, to a general-purpose register.
- Bitwise operations
The bitwise instructions perform bitwise operations on vectors. Instructions include:
Bitwise shift, reverse, and count.
- Floating-point operations
The floating-point instructions operate on floating-point data within a vector. Instructions include:
Floating-point multiply accumulate.
Floating-point complex arithmetic.
Floating-point rounding and conversion.
Floating-point transcendental acceleration.
Floating-point indexed multiplies.
- Predicate operations
The predicate instructions relate to operations that manipulate the predicate registers. Instructions include:
Predicate move operations.
Predicate logical operations.
FFR predicate handling.
- Move operations
These instructions move data between different vector elements, or between vector elements and scalar registers. Instructions include:
Element permute and shuffle.
Index vector generation.
- Reduction operations
Horizontal reduction instructions perform arithmetic horizontally across active elements of a single source vector and deliver a scalar result. Instructions include:
Arm C Language Extensions (ACLE) for Arm SVE
The aim of the Arm C language extensions (ACLE) is to make features of the Arm architecture directly available in C and C++ programs. The core ACLE is defined in a dedicated document, while the ACLE for Arm SVE document defines the part that is specific to the Arm Scalable Vector Extension (SVE).
General information on Arm C Language Extensions is available on the ACLE Developer web page.