What is Scalable Vector Extension?

This topic is a brief introduction to Scalable Vector Extension (SVE). 

Introduction

Scalable Vector Extension (SVE) is the next-generation SIMD instruction set for Armv8-A (AArch64). Unlike other SIMD architectures, SVE does not define the size of the vector registers, but constrains it to a range of possible values, from a minimum of 128 bits up to a maximum of 2048 in 128-bit wide units. Therefore, any CPU vendor can implement the extension by choosing the vector register size that better suits the workloads the CPU is targeting. The design of SVE guarantees that the same program can run on different implementations of the instruction set architecture without the need to recompile the code.

Note: SVE is not an extension of NEON. SVE is a new set of vector instructions developed to target HPC workloads.

The SVE instruction set introduces the following new architectural features for High Performance Computing (HPC):

Scalable vector length

Vector code allows each implementation to automatically choose its vector length so long as it is a multiple of 128 bits and does not exceed the architectural maximum of 2048 bits. SVE provides 32 scalable vector registers, named Z0 – Z31.

Per-lane predication

SVE provides 16 predicate registers, named p0-p15, with each predicate register being 1/8th of the size of the vector register (1 bit per byte), and therefore scalable in size.  Predicate registers are written to by condition-creating instructions, such as compares.  This allows subsequent instructions to control which elements (or ‘lanes’) of a vector should be operated upon (the active elements).

Gather-load and scatter-store

Allows data to be efficiently transfered to or from a vector of non-contiguous memory addresses, with or without predication. This enables a significantly wider range of source code constructs to be vectorized. To permit efficient accesses to contiguous memory data, SVE provides a rich set of load and store instructions which progress sequentially forwards through an array, supporting a full range of packed 8, 16, 32 and 64-bit vector element organizations.

Vector partitioning

A vector partition is the dynamically determined portion of a vector defined by a predicate register. SVE permits the progression of a loop one partition at a time, until the whole vector has been processed or the loop has reached its natural conclusion.

Fault-tolerant speculative vectorization

Causes memory faults to be suppressed if they do not occur because of the first active element of the vector, and instead generates a predicate value indicating which of the requested lanes were successfully loaded prior to the first memory fault. This allows loops with conditional exits or unknown trip-counts to be safely vectorized, maintaining the same faulting behavior as if they had been executed sequentially.

Horizontal and serialized vector operations

SVE has a family of horizontal reduction instructions which include integer and floating-point summation, minimum, maximum, and bitwise logical reductions. SVE allows pointer chasing loops to be performed serially using a vector of addresses and an associated predicate, allowing the remainder of the loop to be parallelized.

Registers

The instruction set operates on a set of vector and predicate registers, as described in the following table:

Register

Type

Quantity

Size

z0 to z31

Data registers

32

Must be a minimum of 128 bits to an implementation-defined maximum of up to 2048 bits. Data in these registers can be interpreted as 8-bit bytes, 16-bit halfwords, 32-bit words or 64-bit doublewords. For example, a 384-bit implementation of SVE can hold 48 bytes, 24 halfwords, 12 words, or 6 doublewords of data.

p0 to p15

Predicate registers

16

Holds one bit for each byte available in a Z register. For example, an implementation providing 1024-bit Z registers provides 128-bit predicate registers.

FFR

Special use predicate register

1

Used implicitly by some dedicated instructions, called first faulting loads.

Assembly language

The SVE assembly language is designed to closely mirror the AArch64 Advanced SIMD mnemonics and operand syntax. However, SVE has significant differences from Advanced SIMD which require extensions to the A64 assembly language, as follows:

New register files for vectors and predicates

Adds the register names z0-z31 and p0-p15.

Vector and predicate registers have unknown size

The element count is absent from a SVE vector or predicate shape suffix.

A predicate is a “bit mask”

SVE-capable assemblers will report any inconsistencies between size suffixes and other operands as an error.

Zeroing or merging predication

Predicated instructions either zero the values of inactive lanes, ‘zeroing form’, or merge in the prior values, ‘merging form’.  Where instructions support both forms, the general predicate for these instructions have a suffix that indicates which form is being used.

Destructive encodings

Many instructions have destructive two-operand forms where the destination register also contains one of the source operands. To avoid ambiguity, the syntax uses a three-operand constructive notation, with the destructive operand being repeated in both the destination and source positions.

Gather-scatter addressing

The A64 load/store address syntax is extended to allow vector operands within the address specifier.

Predicate / vector condition codes

Adds a new set of aliases for condition codes for use in SVE assembler source and disassembly.

SVE instruction set

The SVE instructions can be separated by function, into the following groups. For a more detailed description of the instructions, see the Arm Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for Armv8-A document.

Load, store, and prefetch instructions
SVE vector load and store instructions transfer data in memory to or from elements of one or more vector or predicate transfer registers. SVE also includes vector prefetch instructions that provide read and write hints to the memory system. Instructions include:

  • Predicated single vector contiguous element accesses
  • Predicated non-contiguous element accesses
  • Predicated multiple vector contiguous structure load/store
  • Predicated replicating element loads
  • Unpredicated vector register load/store
  • Unpredicated predicate register load/store

Vector move operations
Vector move instructions copy data from scalar registers, immediate values, and other vectors to selected vector elements. Instructions include:

  • Element move and broadcast

Integer operations
The integer instructions operate on signed or unsigned integer data within a vector. Instructions include:

  • Integer arithmetic
  • Integer dot product
  • Integer comparisons

Vector address calculation
The vector address calculation instructions compute vectors of addresses and addresses of vectors. This includes instructions to add a multiple of the current vector length or predicate register length, in bytes, to a general-purpose register.

Bitwise operations
The bitwise instructions perform bitwise operations on vectors. Instructions include:

  • Bitwise logical operations
  • Bitwise shift, reverse, and count

Floating-point operations
The floating-point instructions operate on floating-point data within a vector. Instructions include:

  • Floating-point arithmetic
  • Floating-point multiply accumulate
  • Floating-point complex arithmetic
  • Floating-point rounding and conversion
  • Floating-point comparisons
  • Floating-point transcendental acceleration
  • Floating-point indexed multiplies

Predicate operations
The predicate instructions relate to operations that manipulate the predicate registers. Instructions include:

  • Predicate initialization
  • Predicate move operations
  • Predicate logical operations
  • FFR predicate handling
  • Predicate counts
  • Loop control
  • Serialized operations

Move operations
These instructions move data between different vector elements, or between vector elements and scalar registers. Instructions include:

  • Element permute and shuffle
  • Unpacking instructions
  • Predicate permute
  • Index vector generation
  • Move prefix

Reduction operations
Horizontal reduction instructions perform arithmetic horizontally across active elements of a single source vector and deliver a scalar result. Instructions include:

  • Horizontal reductions

Resources

For a more detailed description of SVE, why it is useful for HPC, or to see some example code, see the following additional resources on SVE:

Want to evaluate SVE?

Download Arm Compiler for HPC and Arm Instruction Emulator to compile and run SVE code on non-SVE platforms. Get Code Advisor to see actionable insights and advice for how to optimize your code.