Factors affecting NEON vectorization performance
The automatic vectorization process and performance of the generated code is affected by a number of criteria:
- The way loops are organized
For best performance, the innermost loop in a loop nest must access arrays with a stride of one.
- The way the data is structured
The data type dictates how many data elements can be held in a NEON register, and therefore how many operations can be performed in parallel.
- The iteration counts of loops
Longer iteration counts are generally better, because the loop overhead is reduced over more iterations. Tiny iteration counts, such as two or three elements, can be faster to process with nonvector instructions.
- The data type of arrays
For example, NEON does not improve performance when double precision floating point arrays are used.
- The use of memory hierarchy
Most current processors are relatively unbalanced between memory bandwidth and processor capacity. For example, performing relatively few arithmetic operations on large data sets retrieved from main memory is limited by the memory bandwidth of the system.