Load and store - data structures

Neon structure load instructions read data from memory into 64-bit Neon registers, with optional deinterleaving.

Structure store instructions work similarly, reinterleaving data from registers before writing it to memory, as shown in the following diagram:

Syntax

The structure load and store instructions follow a consistent syntax.

The following diagram shows the general syntax of the structure load and store instructions:

This instruction syntax has the following format:

  • An instruction mnemonic, with two parts:
    • The operation, either LD for loads or ST for stores.
    • A numeric interleave pattern specifying the number of elements in each structure.
  • A set of 64-bit Neon registers to be read or written. A maximum of four registers can be listed, depending on the interleave pattern. Each entry in the set of Neon registers has two parts:
    • The Neon register name, for example V0.
    • An arrangement specifier. This indicates the number of bits in each element and the number of elements that can fit in the Neon vector register. For example, 16B indicates that each element is one byte (B), and each vector is a 128-bit vector containing 16 elements.
  • A general-purpose register containing the location to access in memory. The address can be updated after the access.

Interleave pattern

Neon provides instructions to load and store interleaved structures containing from one to four equally sized elements. Elements are the standard Neon-supported widths of 8 (B), 16 (H), 32 (S), or 64 (D) bits.

  • LD1 is the simplest form. It loads one to four registers of data from memory, with no deinterleaving. You can use LD1 to process an array of non-interleaved data.
  • LD2 loads two or four registers of data, deinterleaving even and odd elements into those registers. You can use LD2 to separate stereo audio data into left and right channels.
  • LD3 loads three registers and deinterleaves. You can use LD3 to split RGB pixel data into separate color channels.
  • LD4 loads four registers and deinterleaves. You can use LD4 to process ARGB image data.

The store instructions ST1, ST2, ST3, and ST4 support the same options, but interleave the data from registers before writing them to memory.

Element types

Loads and stores interleave elements based on the size that is specified to the instruction.

For example, consider the following instruction:

LD2 {V0.8H, V1.8H}, [X0]

This instruction loads two Neon registers with deinterleaved data starting from the memory address in X0. The 8H in the arrangement specifier indicates that each element is a 16-bit halfword (H), and each Neon register is loaded with eight elements. This instruction therefore results in eight 16-bit elements in the first register V0, and eight 16-bit elements in the second register V1. Adjacent pairs (even and odd) are separated to each register, as shown in the following diagram:

The following instruction uses the arrangement specifier 4S, changing the element size to 32-bits:

LD2 {V0.4S, V1.4S}, [X0]

Changing the element size to 32-bits loads the same amount of data, but now only four elements make up each vector, as shown in the following diagram:

Element size also affects endianness handling. In general, if you specify the correct element size to the load and store instructions, bytes are read from memory in the appropriate order. This means that the same code works on little-endian systems and big-endian systems.

Finally, element size has an impact on pointer alignment. Alignment to the element size generally gives better performance, and it might be a requirement of your target operating system. For example, when loading 32-bit elements, align the address of the first element to at least 32-bits.

Single or multiple elements

In addition to loading multiple elements, structure loads can also read single elements from memory with deinterleaving. Data can either be replicated to all lanes of a Neon register, or inserted into a single lane, leaving the other lanes intact.

For example, the following instruction loads a single three-element data structure from the memory address pointed to by X0, then replicates that data into all lanes of three Neon registers:

LD3R  { V0.16B, V1.16B, V2.16B } , [x0] 

The following diagram shows the operation of this instruction:

By contrast, the following instruction loads a single three-element data structure into a single lane of three Neon registers, leaving the other lanes intact:

LD3 { V0.B, V1.B, V2.B }[4] , [x0]

The following diagram shows the operation of this instruction. This form of the load instruction is useful when you need to construct a vector from data scattered in memory.

Stores are similar, providing support for writing single or multiple elements with interleaving.

Addressing

Structure load and store instructions support three formats for specifying addresses:

  • Register (no offset): [Xn]

    This is the simplest form. Data is loaded addition stored to the address that is specified by Xn.

  • Register with post-index, immediate offset: [Xn], #imm

    Use this form to update the pointer in Xn after loading or storing, ready to load or store the next elements.

    The immediate increment value #imm must be equal to the number of bytes that is read or written by the instruction.

    For example, the following instruction loads 48 bytes of data, using three registers, each containing 16 x 1 byte data elements. This means that the immediate increment is 48:

    LD3 { V0.16B, V1.16B, V2.16B }, [x0], #48

    However, the next example loads 32 bytes of data, using two registers, each containing 2 x  8 byte data elements. This means that the immediate increment is 32:

    LD2     { V0.2D, V1.2D}, [x0], #32
  • Register with post-index, register offset: [Xn], Xm

    After the memory access, increment the pointer by the value in register Xm. This form is useful when reading or writing groups of elements that are separated by fixed widths, for example when reading a vertical line of data from an image.

Other types of loads and stores

This guide only deals with structure loads and stores. However, Neon also provides other types of load and store instruction, including:

  • LDR and STR to load and store single Neon registers.
  • LDP and STP to load or store pairs of Neon registers.

For more details on supported load and store operations, see the Arm Architecture Reference Manual.

Detailed cycle timing information for the instructions can be found in the Technical Reference Manual for each core.

Previous Next