Scalable Vector Extension (SVE) is the next-generation SIMD instruction set for Armv8-A (AArch64). Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains it to a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. The design of SVE guarantees that the same program can run on different implementations of the instruction set architecture without the need to recompile the code.
Note: SVE is not an extension of NEON. SVE is a new set of vector instructions developed to target HPC workloads.
The SVE instruction set introduces the following new architectural features for High Performance Computing (HPC):
Scalable vector length
Vector code allows each implementation to automatically choose its vector length so long as it is a multiple of 128 bits and does not exceed the architectural maximum of 2048 bits. SVE provides 32 scalable vector registers, named Z0 – Z31.
SVE provides 16 predicate registers, named p0-p15, with each predicate register being 1/8th of the size of the vector register (1 bit per byte), and therefore scalable in size. Predicate registers are written to by condition-creating instructions, such as compares. This allows subsequent instructions to control which elements (or ‘lanes’) of a vector should be operated upon (the active elements).
Gather-load and scatter-store
Allows data to be efficiently transfered to or from a vector of non-contiguous memory addresses, with or without predication. This enables a significantly wider range of source code constructs to be vectorized. To permit efficient accesses to contiguous memory data, SVE provides a rich set of load and store instructions which progress sequentially forwards through an array, supporting a full range of packed 8, 16, 32 and 64-bit vector element organizations.
A vector partition is the dynamically determined portion of a vector defined by a predicate register. SVE permits the progression of a loop one partition at a time, until the whole vector has been processed or the loop has reached its natural conclusion.
Fault-tolerant speculative vectorization
Causes memory faults to be suppressed if they do not occur because of the first active element of the vector, and instead generates a predicate value indicating which of the requested lanes were successfully loaded prior to the first memory fault. This allows loops with conditional exits or unknown trip-counts to be safely vectorized, maintaining the same faulting behavior as if they had been executed sequentially.
Horizontal and serialized vector operations
SVE has a family of horizontal reduction instructions which include integer and floating-point summation, minimum, maximum, and bitwise logical reductions. SVE allows pointer chasing loops to be performed serially using a vector of addresses and an associated predicate, allowing the remainder of the loop to be parallelized.
The instruction set operates on a set of vector and predicate registers, as described in the following table:
z0 to z31
Must be a minimum of 128 bits to an implementation-defined maximum of up to 2048 bits. Data in these registers can be interpreted as 8-bit bytes, 16-bit halfwords, 32-bit words or 64-bit doublewords. For example, a 384-bit implementation of SVE can hold 48 bytes, 24 halfwords, 12 words, or 6 doublewords of data.
p0 to p15
Holds one bit for each byte available in a Z register. For example, an implementation providing 1024-bit Z registers provides 128-bit predicate registers.
Special use predicate register
Used implicitly by some dedicated instructions, called first faulting loads.
The SVE assembly language is designed to closely mirror the AArch64 Advanced SIMD mnemonics and operand syntax. However, SVE has significant differences from Advanced SIMD which require extensions to the A64 assembly language, as follows:
New register files for vectors and predicates
Adds the register names z0-z31 and p0-p15.
Vector and predicate registers have unknown size
The element count is absent from a SVE vector or predicate shape suffix.
A predicate is a “bit mask”
SVE-capable assemblers will report any inconsistencies between size suffixes and other operands as an error.
Zeroing or merging predication
Predicated instructions either zero the values of inactive lanes, ‘zeroing form’, or merge in the prior values, ‘merging form’. Where instructions support both forms, the general predicate for these instructions have a suffix that indicates which form is being used.
Many instructions have destructive two-operand forms where the destination register also contains one of the source operands. To avoid ambiguity, the syntax uses a three-operand constructive notation, with the destructive operand being repeated in both the destination and source positions.
The A64 load/store address syntax is extended to allow vector operands within the address specifier.
Predicate / vector condition codes
Adds a new set of aliases for condition codes for use in SVE assembler source and disassembly.
SVE instruction set
The SVE instructions can be separated by function, into the following groups. For a more detailed description of the instructions, see the ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A document.
Load, store, and prefetch instructions
SVE vector load and store instructions transfer data in memory to or from elements of one or more vector or predicate transfer registers. SVE also includes vector prefetch instructions that provide read and write hints to the memory system. Instructions include:
- Predicated single vector contiguous element accesses
- Predicated non-contiguous element accesses
- Predicated multiple vector contiguous structure load/store
- Predicated replicating element loads
- Unpredicated vector register load/store
- Unpredicated predicate register load/store
Vector move operations
Vector move instructions copy data from scalar registers, immediate values, and other vectors to selected vector elements. Instructions include:
- Element move and broadcast
The integer instructions operate on signed or unsigned integer data within a vector. Instructions include:
- Integer arithmetic
- Integer dot product
- Integer comparisons
Vector address calculation
The vector address calculation instructions compute vectors of addresses and addresses of vectors. This includes instructions to add a multiple of the current vector length or predicate register length, in bytes, to a general-purpose register.
The bitwise instructions perform bitwise operations on vectors. Instructions include:
- Bitwise logical operations
- Bitwise shift, reverse, and count
The floating-point instructions operate on floating-point data within a vector. Instructions include:
- Floating-point arithmetic
- Floating-point multiply accumulate
- Floating-point complex arithmetic
- Floating-point rounding and conversion
- Floating-point comparisons
- Floating-point transcendental acceleration
- Floating-point indexed multiplies
The predicate instructions relate to operations that manipulate the predicate registers. Instructions include:
- Predicate initialization
- Predicate move operations
- Predicate logical operations
- FFR predicate handling
- Predicate counts
- Loop control
- Serialized operations
These instructions move data between different vector elements, or between vector elements and scalar registers. Instructions include:
- Element permute and shuffle
- Unpacking instructions
- Predicate permute
- Index vector generation
- Move prefix
Horizontal reduction instructions perform arithmetic horizontally across active elements of a single source vector and deliver a scalar result. Instructions include:
- Horizontal reductions
For a more detailed description of SVE, why it is useful for HPC, or to see some example code, see the following additional resources on SVE:
- Why SVE for HPC?
A short introduction as to why SVE is beneficial for HPC.
- White Paper: A sneak peek into SVE and VLA programming
An overview of SVE with information on the new registers, the new instructions, and the Vector Length Agnostic (VLA) programming technique, with some examples.
- White Paper: Arm Scalable Vector Extension and application to Machine Learning
In this white paper, code examples are presented that show how to vectorize some of the core computational kernels that are part of machine learning system. These examples are written with the Vector Length Agnostic (VLA) approach introduced by the Scalable Vector Extension (SVE).
- ARM C Language Extensions for SVE
The SVE ACLE defines a set of C and C++ types and accessors for SVE vectors and predicates.
- DWARF for the ARM® 64-bit Architecture (AArch64) with SVE support
This document describes the use of the DWARF debug table format in the Application Binary Interface (ABI) for the Arm 64-bit architecture.
- Procedure Call Standard for the ARM 64-bit Architecture (AArch64) with SVE support
This document describes the Procedure Call Standard use by the Application Binary Interface (ABI) for the Arm 64-bit architecture.
- ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A
This supplement describes the Scalable Vector Extension to the Armv8-A architecture profile.