Arm Helium technology is the M-Profile Vector Extension (MVE) for the Arm Cortex-M processor series. Helium is an extension of the Armv8.1-M architecture and delivers a significant performance increase for Machine Learning (ML) and Digital Signal Processing (DSP) applications.
Helium lets you optimize performance by using Single Instruction Multiple Data (SIMD) to perform the same operation simultaneously on multiple data items. The Helium instructions are designed for Digital Signal Processing (DSP) and Machine Learning (ML) applications in particular.
Data processing methodologies
A common technique to accelerate workloads that process many data elements at the same time with one instruction is to exploit opportunities for parallelism that are often available. This data processing methodology is called Single-Instruction Multiple-Data (SIMD). Processing elements sequentially and one at a time is called Single-Instruction Single-Data (SISD). In this section of the guide, we will discuss both operating on multiple data elements at the same time reduces the total time that is required to process all data elements. A reduced compute time means that results are available at a higher throughput and lower latency, and that the CPU can go in a low power state early to save energy. With Armv8.1-M, the SIMD data processing methodology is supported with the Arm Helium technology.
Single Instruction Single Data
Most Arm instructions are Single Instruction Single Data (SISD). Each instruction performs its specified operation on a single data item. Processing multiple data items therefore requires multiple instructions.
The following example shows how to perform four addition operations. You can see that this requires four instructions to add values from four pairs of registers:
ADD r0, r0, r5 ADD r1, r1, r6 ADD r2, r2, r7 ADD r3, r3, r8
This method is slow, and it can be difficult to see how different registers are related.
If the values that you are dealing with, are smaller than the maximum register size, the extra potential bandwidth is wasted with SISD instructions. For example, when adding 8-bit values together, each 8-bit value must be loaded into a separate 32-bit register. Performing many individual operations on small data items does not use machine resources efficiently.
Single Instruction Multiple Data
Single Instruction Multiple Data (SIMD) instructions perform the same operation for multiple data items. These data items are packed as separate lanes in a larger register.
For example, the following instruction adds four pairs of 32-bit values together. However, in this case, the values are packed into separate lanes in two pairs of 128-bit registers: the q registers. Each lane in the first source register is then added to the corresponding lane in the second source register, before being stored in the same lane in the destination register. You can see this in the following code:
VADD.I32 q2, q1, q0 // This operation adds two 32-bit (word) lanes, q0 and q1, // and stores the result in q2. // Each of the four 32-bit lanes in each register is added separately. // There are no carries between the lanes.
This single instruction operates on all data values in the large register, as you can see here:
Being able to specify parallel operations in a single instruction like this allows the processor to do the calculations simultaneously, which increases performance and throughput. The preceding diagram shows 128-bit registers, with each register holding four 32-bit values. Operations on data elements with different data sizes are possible for Helium registers. We explain this in more detail in Helium registers.
The addition operations that are shown in the diagram are independent for each lane. For example, any overflow or carry from lane 0 does not affect lane 1. Lane 1 is an entirely separate calculation.
Media processors, like audio and video devices, often split each full data register into multiple registers and perform computations on the registers in parallel. If the processing for the data sets is simple and repeated many times, SIMD can give considerable performance improvements. It is beneficial for digital signal processing, and multimedia algorithms, for example:
- Audio, video, and image-processing codecs
- 2D graphics based on rectangular blocks of pixels
- 3D graphics
- Color-space conversion
- Physics simulations
- Machine Learning
Helium and Neon comparison
One of their main differences between Helium and Neon is that Helium is the extension that is used for the Armv8.1-M architecture. Neon is the extension that is used for the Armv7-A architecture.
The similarities between Helium and Neon are:
- Both use 128-bit vectors.
- Both reuse floating point registers.
- Both provide instructions to perform vector add and vector multiply.
The differences between Helium and Neon are:
- Helium includes eight vector registers in Helium and Neon includes 16 vector registers.
- Helium includes both regular registers and vector registers. This allows Helium to reduce the register pressure.
- Helium supports some new data types, for example, fp16, which is available in Armv8.2-A. Neon does not support these data types.
- Helium includes features like loop predication and lane predication that Neon does not include.
Two of the main applications for Helium are Digital Signal Processing (DSP) and Machine Learning (ML). Helium offers significant performance increase in these areas.
Helium provides intrinsics targeted for DSP instructions, for example:
vld2q, which loads blocks of data from memory and writes them to the destination registers
vrmlsldavhaqis useful for the multiplication of complex numbers, which is often used in DSP.
ML is a subset of Artificial Intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. Helium helps to boost Matrix Multiplication operations, which are the foundation of Convolutional Neural Networks or Classical based Machine Learning kernels.
Applications that can be greatly accelerated by Helium are Fast Fourier Transform (FFT) and Complex Dot Product as there are specific instructions which help implement these calculations.