Many ARM CPU and Mali GPU processors include vector or Single Instruction Multiple Data (SIMD) instructions. These enable the processor to perform multiple operations with a single instruction.
Vector processing works by processing multiple operations in parallel with a single instruction. The number and type of operations you can do depends on the type of vector processor extension in your processor.
For example, an ARM processor with the NEON Media Processing Engine can do up to 4 32bit operations, 8 16-bit operations, or 16 8-bit operations simultaneously, depending on the implementation.
Using vector instructions can produce a very large performance boost for some operations. Use vector processing where possible. This increases performance and reduces code size making it more cacheable.
You can sometime use vector instruction in a loop as a form of loop unrolling. This can reduce the number of total iterations the loop must do by 4 or more times.
If the number of data items being processed is not a multiple of the number of elements in the vector, you might require additional code to process the end and possibly start elements. This code is only executed once.