You copied the Doc URL to your clipboard.

What is the Scalable Vector Extension?

This topic is a brief introduction to Scalable Vector Extension (SVE).

Introduction

Scalable Vector Extension (SVE) is the next-generation SIMD extension of the Armv8-A A64 instruction set. It is not an extension of NEON, but a new set of vector instructions developed to target HPC workloads. SVE enables vectorization of loops which would either be impossible or not beneficial to vectorize with NEON.

Unlike other SIMD architectures, SVE can be Vector Length Agnostic; it does does not fix the size of the vector registers leaving hardware implementors free to choose the size best suited to the intended workloads.

The SVE instruction set introduces the following new architectural features for High Performance Computing (HPC):

Scalable vector length

Vector code allows each implementation to automatically choose its vector length when it is a multiple of 128 bits and does not exceed the architectural maximum of 2048 bits. SVE provides 32 scalable vector registers, named Z0 – Z31.

Per-lane predication

SVE provides 16 predicate registers, named p0-p15, with each predicate register being 1/8th of the size of the vector register (1 bit per byte), and therefore scalable in size.  Predicate registers are written to using condition-creating instructions, such as compares.  This allows subsequent instructions to control which elements (or ‘lanes’) of a vector should be operated upon (the active elements).

Gather-load and scatter-store

Allows data to be efficiently transferred to or from a vector of non-contiguous memory addresses. This enables a significantly wider range of source code constructs to be vectorized. To permit efficient accesses to contiguous memory data, SVE provides a rich set of load and store instructions which progress sequentially forwards through an array, supporting a full range of packed 8, 16, 32 and 64-bit vector element organizations.

Vector partitioning

A vector partition is the dynamically determined portion of a vector defined by a predicate register. SVE permits the progression of a loop one partition at a time, until the whole vector has been processed or the loop has reached its natural conclusion.

Fault-tolerant speculative vectorization

Causes memory faults to be suppressed if they do not occur because of the first active element of the vector, and instead generates a predicate value indicating which of the requested lanes were successfully loaded prior to the first memory fault. This allows loops with conditional exits or unknown trip-counts to be safely vectorized, maintaining the same faulting behavior as if they had been executed sequentially. A common use for fault-tolerant speculative vectorization is in C strings.

Horizontal vector operations

SVE has a family of horizontal reduction instructions which include integer and floating-point summation, minimum, maximum, and bitwise logical reductions.

Serialized vector operations

SVE allows pointer chasing loops to be performed serially using a vector of addresses and an associated predicate, allowing the remainder of the loop to be parallelized.

Registers

The instruction set operates on a new set of vector and predicate registers:

  • 32 Z registers, z0, z1, …, z31;

  • 16 P registers, p0, p1, …, p15;

  • 1 First Faulting Register (FFR) register.

The Z registers are data registers. The architecture specifies that their size in bits must be a multiple of 128, from a minimum of 128 bits to an implementation-defined maximum of up to 2048 bits. Data in these registers can be interpreted as 8-bit bytes, 16-bit halfwords, 32-bit words or 64-bit doublewords. For example, a 384-bit implementation of SVE can hold 48 bytes, 24 halfwords, 12 words, or 6 doublewords of data. It is also important to mention that the low 128 bits of each Z register overlap the corresponding NEON registers of the Advanced SIMD extension and therefore also the scalar floating-point registers:

Register overlapping.

P registers are predicate registers, which are unique to SVE, and hold one bit for each byte available in a Z register. For example, an implementation providing 1024-bit Z registers provides 128-bit predicate registers.

The FFR register is a special predicate register that differs from regular predicate registers by way of being used implicitly by some dedicated instructions, called first faulting loads.

Individual predicate bits encode a boolean true or false, but a predicate lane, which contains between one and eight predicate bits is either active or inactive, depending on the value of its least significant bit.

Similarly, in this document the terms active or inactive lane will be used to qualify the lanes of data registers under the control of a predicate register.

Assembly language

The SVE assembly language is designed to closely mirror the AArch64 NEON mnemonics and operand syntax. However, SVE has significant differences which require extensions to the A64 assembly language, as follows:

New register files for vectors and predicates

Adds the register names z0-z31 and p0-p15.

Vector and predicate registers have unknown size

The element count is absent from a SVE vector or predicate shape suffix.

A predicate is a “bit mask”

SVE-capable assemblers report any inconsistencies between size suffixes and other operands as an error.

Zeroing or merging predication

Predicated instructions either zero the values of inactive lanes, ‘zeroing form’, or merge in the prior values, ‘merging form’. These instructions have a suffix that indicates which form is being used.

Destructive encodings

Many instructions have destructive two-operand forms where the destination register also contains one of the source operands. To avoid ambiguity, the syntax uses a three-operand constructive notation, with the destructive operand being repeated in both the destination and source positions.

Gather-scatter addressing

The A64 load/store address syntax is extended to allow vector operands within the address specifier.

Predicate / vector condition codes

Adds a new set of aliases for condition codes for use in SVE assembler source and disassembly.

SVE instruction set

SVE introduces a variety of instructions that operate on the data and predicate registers. There are two main classes of instructions; predicated and unpredicated. Instructions that use a predicate register to control the lanes they operate on, versus those that do not. In a predicated instruction, only the active lanes of vector operands are processed and can generate side effects - such as memory accesses and faults, or numeric exceptions.

Across these two main classes, there are data processing instructions, that operate on Z registers (for example, addition), predicate generation instructions, such as numeric comparisons that operate on data registers and produce predicate registers, or predicate manipulation instructions, that mostly cover predicate generation or logical operations on predicates.

Note

Only predicate registers p0 through p7 are usable as predicates in data-processing instructions.

Most data manipulation operations cover both floating point (FP) and integer domains, with some notable FP functionality brought by the ordered horizontal reductions, which provide cross-lane operations that preserve the strict C/C++ rules on non-associativity of floating-point operations.

A significant proportion of the new instruction set is dedicated to vector load/store instructions, which can perform signed or unsigned extension or truncation of the data, and come with a wide range of new addressing modes that improve the efficiency of SVE code.

SVE instructions can be separated by function into the following groups. For a more detailed description of the instructions, see the ARM Architecture Reference Manual Supplement - The Scalable Vector Extension (SVE), for ARMv8-A document.

Load, store, and prefetch instructions

SVE vector load and store instructions transfer data in memory to or from elements of one or more vector or predicate registers. SVE also includes vector prefetch instructions that provide read and write hints to the memory system. Instructions include:

  • Predicated single vector contiguous element accesses

  • Predicated non-contiguous element accesses

  • Predicated multiple vector contiguous structure load/store

  • Predicated replicating element loads

  • Unpredicated vector register load/store

  • Unpredicated predicate register load/store

Note

Unpredicated vector register load/store do not have endianness conversion, and should not be used for your code.

Vector move operations

Vector move instructions copy data from scalar registers, immediate values, and other vectors to selected vector elements. Instructions include:

  • Element move and broadcast

Integer operations

The integer instructions operate on signed or unsigned integer data within a vector. Instructions include:

  • Integer arithmetic

  • Integer dot product

  • Integer comparisons

Vector address calculation

The vector address calculation instructions compute vectors of addresses and addresses of vectors. This includes instructions to add a multiple of the current vector length or predicate register length, in bytes, to a general-purpose register.

Bitwise operations

The bitwise instructions perform bitwise operations on vectors. Instructions include:

  • Bitwise shift, reverse, and count

Floating-point operations

The floating-point instructions operate on floating-point data within a vector. Instructions include:

  • Floating-point arithmetic

  • Floating-point multiply accumulate

  • Floating-point complex arithmetic

  • Floating-point rounding and conversion

  • Floating-point comparisons

  • Floating-point transcendental acceleration

  • Floating-point indexed multiplies

Predicate operations

The predicate instructions relate to operations that manipulate the predicate registers. Instructions include:

  • Predicate initialization

  • Predicate move operations

  • Predicate logical operations

  • FFR predicate handling

  • Predicate counts

  • Loop control

  • Serialized operations

Move operations

These instructions move data between different vector elements, or between vector elements and scalar registers. Instructions include:

  • Element permute and shuffle

  • Unpacking instructions

  • Predicate permute

  • Index vector generation

  • Move prefix

Reduction operations

Horizontal reduction instructions perform arithmetic horizontally across active elements of a single source vector and deliver a scalar result. Instructions include:

  • Horizontal reductions

Arm C Language Extensions (ACLE) for Arm SVE

The aim of the Arm C language extensions (ACLE) is to make features of the Arm architecture directly available in C and C++ programs. The core ACLE is defined in a dedicated document, while the ACLE for Arm SVE document defines the part that is specific to the Arm Scalable Vector Extension (SVE).

General information on Arm C Language Extensions is available on the ACLE Developer web page.

Was this page helpful? Yes No