Load and store - example RGB conversion

This section considers an example task of converting RGB data to BGR color data.

In a 24-bit RGB image, the pixels are arranged in memory as R, G, B, R, G, B, and so on. You want to perform a simple image-processing operation, like switching the red and blue channels. How can you do this efficiently using Neon?

Using a load that pulls RGB data items sequentially from memory into registers makes swapping the red and blue channels awkward.

Consider the following instruction, which loads RGB data one byte at a time from memory into consecutive lanes of three Neon registers:

LD1 { V0.16B, V1.16B, V2.16B }, [x0]

The following diagram shows the operation of this instruction:

Code to swap channels based on this input would be complicated. We would need to mask different lanes to obtain the different color components, then shift those components and recombine. The resulting code is unlikely to be efficient.

Neon provides structure load and store instructions to help in these situations. These instructions pull in data from memory and simultaneously separate the loaded values into different registers. For this example, you can use the LD3 instruction to separate the red, green, and blue data values into different Neon registers as they are loaded:

LD3 { V0.16B, V1.16B, V2.16B }, [x0]

The following diagram shows how the above instruction separates the different data channels:

The red and blue values can now be switched easily using the MOV instruction to copy the entire vector. Finally, we write the data back to memory, with reinterleaving, using the ST3 store instruction.

A single iteration of this RGB to BGR switch can be coded as follows:

LD3  { V0.16B, V1.16B, V2.16B }, [x0], #48  // 3-way interleaved load from
                                            // address in X0, post-incremented
                                            // by 48
MOV    V3.16B, V0.16B                       // Swap V0 -> V3
MOV    V0.16B, V2.16B                       // Swap V2 -> V0
MOV    V2.16B, V3.16B                       // Swap V3 -> V2 
                                            // (net effect is to swap V0 and V2)
ST3  { V0.16B, V1.16B, V2.16B }, [x1], #48  // 3-way interleaved store to address
                                            // in X1, post-incremented by 48

Each iteration of this code does the following:

  • Loads from memory 16 red bytes into V0, 16 green bytes into V1, and 16 blue bytes into V2.
  • Increments the source pointer in X0 by 48 bytes ready for the next iteration. The increment of 48 bytes is the total number of bytes that we read into all three registers, so 3 x 16 bytes in total.
  • Swaps the vector of red values in V0 with the vector of blue values in V2, using V3 as an intermediary.
  • Stores the data in V0, V1, and V2 to memory, starting at the address that is specified by the destination pointer in X1, and increments the pointer.
Previous Next