Permutation - rearranging vectors
When writing programs for SIMD architectures like Neon, performance is often directly related to data ordering. The ordering of data in memory might be inappropriate or suboptimal for the operation that you want to perform.
One solution to these issues might be to rearrange the entire data set in memory before data processing begins. However, this approach is likely to have a high cost to performance. This solution might not even be possible, if your input is a continuous stream of data.
A better solution might be to reorder data values as they are processed. Reordering operations is called permutation. Neon provides a range of permute instructions that typically do the following:
- Take input data from one or more source registers
- Rearrange the data
- Write the result of the permutation to a destination register
Permutations can help to optimize data processing, but you must remember the following guidelines:
- Permuting data is only useful if it leads to an overall increase in performance for your application. Do you really need to permute your data?
- Permute instructions always have a time cost because they only prepare data. Permute instructions do not process data.
- Different instructions might use different hardware pipelines. An optimal solution maximizes the use of idle pipelines.
When rearranging data, you have the following goals:
- Minimize the number of permute instructions used.
- Choose instructions that are likely to use idle pipelines when they are executed.
Alternatives to permutation
How can you avoid wasting unnecessary processor cycles on data permutation? Here are some options to consider:
- Change the input data structure.
If the input data is well-ordered to begin with, there is no need to rearrange data during loading. However, consider the effects of data locality on cache performance before changing your data structures.
Changing the structure of input data is often not possible, for example when you do not have control over the format.
- Redesign your algorithm.
- Another algorithm might be available that better suits the input data.
- Modify previous processing stages.
- It might be possible to rearrange data more efficiently earlier in the program, especially if the application has a long or complex data pipeline.
- Use interleaving loads and stores.
- Some Neon load and store instructions can interleave and deinterleave data. These interleaving instructions are often used with explicit data permutations, which reduces the total number of instructions required.
You can use any of these approaches, or a combination, to optimize code for Neon.