Check your knowledge
-
What is Neon?
Neon is the implementation of the Advanced SIMD extension to the Arm architecture. All processors compliant with the Armv8-A architecture (for example, the Cortex-A76 or Cortex-A57) include Neon. In the programmer's view, Neon provides an additional 32 128-bit registers with instructions that operate on 8, 16, 32, or 64 bit lanes within these registers.
-
How do you enable Neon code generation with Arm Compiler?
Target AArch64 with
--target=aarch64-arm-none-eabi
and specify a suitable optimization level, such as-O1 -fvectorize
or-O2
and higher. -
Suppose the Arm compiler automatically unrolls a loop to a depth of two. How would you force the compiler to unroll to a depth of four?
#pragma clang loop interleave_count(4)
will achieve this, applying only to that particular loop. -
How can you best write source code to assist the compiler optimizations?
Consider the following function when compiled with the
-01
compiler option:float vec_dot(float *vec_A, float *vec_B, int len_vec) { float ret = 0; int i; for (i=0; i<len_vec; i++) { ret += vec_A[i]*vec_B[i]; } return ret; }
You could make the following changes to assist the compiler optimizations:
- Compile at
-O2
or higher, or with-fvectorize
. - Specify
#pragma clang loop vectorize(enable)
before the loop as a hint to the compiler. - Note that we are not modifying the vectors during the procedure so adding the
restrict
keyword will do nothing here; it doesn't matter if the input arrays overlap. - SLP vectorization comes with an increased code in this case. This may be acceptable depending on hardware limits and expected input array length.
Here is the optimized source code:
float vec_dot(float *vec_A, float *vec_B, int len_vec) { float ret = 0; int i; #pragma clang loop vectorize(enable) for (i=0; i<len_vec; i++) { ret += vec_A[i]*vec_B[i]; } return ret; }
- Compile at