Automatic vectorization involves the high-level analysis of loops in your code. This is the most efficient way to map the majority of typical code onto the functionality of the NEON unit.
For most code, the gains that can be made with algorithm-dependent parallelism on a smaller scale are very small relative to the cost of automatic analysis of such opportunities. For this reason, the NEON unit is designed as a target for loop-based parallelism.
Vectorization is carried out in a way that ensures that optimized code gives the same results as nonvectorized code. In certain cases, to avoid the possibility of an incorrect result, vectorization of a loop is not carried out. This can lead to suboptimal code, and you might have to manually tune your code to make it more suitable for automatic vectorization.
Automatic vectorization can also often be impeded by earlier manual optimization attempts, for example, manual loop unrolling in the source code, or complex array accesses. For optimal results, it is best to write code using simple loops, enabling the compiler to perform the optimization. For hand-optimized legacy code, it can be easier to rewrite critical portions of the code based on the original algorithm using simple loops.
By coding in vectorizable loops using NEON extensions instead of writing in explicit NEON instructions, code portability is preserved between processors. Performance levels similar to that of hand coded vectorization are achieved with less effort.