Load and store - example RGB conversion
This section considers an example task of converting RGB data to BGR color data.
In a 24-bit RGB image, the pixels are arranged in memory as R, G, B, R, G, B, and so on. You want to perform a simple image-processing operation, like switching the red and blue channels. How can you do this efficiently using Neon?
Using a load that pulls RGB data items sequentially from memory into registers makes swapping the red and blue channels awkward.
Consider the following instruction, which loads RGB data one byte at a time from memory into consecutive lanes of three Neon registers:
LD1 { V0.16B, V1.16B, V2.16B }, [x0]
The following diagram shows the operation of this instruction:
Code to swap channels based on this input would be complicated. We would need to mask different lanes to obtain the different color components, then shift those components and recombine. The resulting code is unlikely to be efficient.
Neon provides structure load and store
instructions to help in these situations. These instructions pull in data from
memory and simultaneously separate the loaded values into different registers.
For this example, you can use the LD3
instruction to separate the red,
green, and blue data values into different Neon registers as they are loaded:
LD3 { V0.16B, V1.16B, V2.16B }, [x0]
The following diagram shows how the above instruction separates the different data channels:
The red and blue values can now be switched easily using the MOV instruction to copy the entire vector. Finally, we write the data back to memory, with reinterleaving, using the ST3 store instruction.
A single iteration of this RGB to BGR switch can be coded as follows:
LD3 { V0.16B, V1.16B, V2.16B }, [x0], #48 // 3-way interleaved load from // address in X0, post-incremented // by 48 MOV V3.16B, V0.16B // Swap V0 -> V3 MOV V0.16B, V2.16B // Swap V2 -> V0 MOV V2.16B, V3.16B // Swap V3 -> V2 // (net effect is to swap V0 and V2) ST3 { V0.16B, V1.16B, V2.16B }, [x1], #48 // 3-way interleaved store to address // in X1, post-incremented by 48
Each iteration of this code does the following:
- Loads from memory 16 red bytes into
V0
, 16 green bytes intoV1
, and 16 blue bytes intoV2
. - Increments the source pointer in
X0
by 48 bytes ready for the next iteration. The increment of 48 bytes is the total number of bytes that we read into all three registers, so 3 x 16 bytes in total. - Swaps the vector of red values in
V0
with the vector of blue values inV2
, usingV3
as an intermediary. - Stores the data in
V0
,V1
, andV2
to memory, starting at the address that is specified by the destination pointer inX1
, and increments the pointer.