Data processing methodologies

When processing large sets of data, a major performance limiting factor is the amount of CPU time taken to perform data processing instructions. This CPU time depends on the number of instructions it takes to deal with the entire data set. And the number of instructions depends on how many items of data each instruction can process.

Single Instruction Single Data (SISD)

Most Arm instructions are Single Instruction Single Data (SISD). Each instruction performs its specified operation on a single data source. Processing multiple data items therefore requires multiple instructions. For example, to perform four addition operations requires four instructions to add values from four pairs of registers:

ADD w0, w0, w5
ADD w1, w1, w6
ADD w2, w2, w7
ADD w3, w3, w8

This method is relatively slow and it can be difficult to see how different registers are related. To improve performance and efficiency, media processing is often off-loaded to dedicated processors such as a Graphics Processing Unit (GPU) or Media Processing Unit which can process more than one data value with a single instruction.

If the values you are dealing with are smaller than the maximum bit size, that extra potential bandwidth is wasted with SISD instructions. For example, when adding 8-bit values together, each 8-bit value needs to be loaded into a separate 64-bit register. Performing large numbers of individual operations on small data sizes does not use machine resources efficiently because processor, registers, and data path are all designed for 64-bit calculations.

Single Instruction Multiple Data

Single Instruction Multiple Data (SIMD) instructions perform the same operation simultaneously for multiple data items. These data items are packed as separate lanes in a larger register.

For example, the following instruction adds four pairs of single-precision (32-bit) values together. However, in this case, the values are packed into separate lanes in two pairs of 128-bit registers. Each lane in the first source register is then added to the corresponding lane in the second source register, before being stored in the same lane in the destination register:

ADD V10.4S, V8.4S, V9.4S
// This operation adds two 128-bit (quadword) registers, V8 and V9,
// and stores the result in V10.
// Each of the four 32-bit lanes in each register is added separately.
// There are no carries between the lanes.

This single instruction operates on all data values in the large register at the same time:

Performing the four operations with a single SIMD instruction is faster than with four separate SISD instructions. 

The diagram shows 128-bit registers each holding four 32-bit values, but other combinations are possible for Neon registers:

  • Two 64-bit, four 32-bit, eight 16-bit, or sixteen 8-bit integer data elements can be operated on simultaneously using all 128 bits of a Neon register.
  • Two 32-bit, four 16-bit, or eight 8-bit integer data elements can be operated on simultaneously using the lower 64 bits of a Neon register (in this case, the upper 64 bits of the Neon register are unused).

Note that the addition operations shown in the diagram are truly independent for each lane. Any overflow or carry from lane 0 does not affect lane 1, which is an entirely separate calculation.

Media processors, such as used in mobile devices, often split each full data register into multiple sub-registers and perform computations on the sub-registers in parallel. If the processing for the data sets are simple and repeated many times, SIMD can give considerable performance improvements. It is particularly beneficial for digital signal processing or multimedia algorithms, such as:

  • Audio, video, and image processing codecs.
  • 2D graphics based on rectangular blocks of pixels.
  • 3D graphics
  • Color-space conversion.
  • Physics simulations.
Previous Next