You copied the Doc URL to your clipboard.

Vectorization diagnostics to tune code for improved performance

The compiler can provide diagnostic information to indicate where vectorization optimizations are successfully applied and where it failed to apply vectorization. The command-line options that provide this information are --diag_warning=optimizations and --remarks.

Example 16 shows two functions that implement a simple sum operation on an array. This code does not vectorize.

int addition(int a, int b)
{
    return a + b;
}
void add_int(int *pa, int *pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < n; i++) *(pa + i) = addition(*(pb + i),x);
    /* Function calls cannot be vectorized */
}

Using the --diag_warning=optimizations option produces an optimization warning message for the addition() function. Similarly, --remarks.

Adding the __inline qualifier to the definition of addition() enables this code to vectorize but it is still not optimal. Using the --diag_warning=optimizations option again produces optimization warning messages to indicate that the loop vectorizes but there might be a potential pointer aliasing problem. Similarly, --remarks.

The compiler must generate a runtime test for aliasing and output both vectorized and scalar copies of the code. Example 17 shows how this can be improved using the restrict keyword if you know that the pointers are not aliased.

__inline int addition(int a, int b)
{
    return a + b;
}
void add_int(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
    unsigned int i;
    for(i = 0; i < n; i++) *(pa + i) = addition(*(pb + i),x);
}

The final improvement that can be made is to the number of loop iterations. In Example 17, the number of iterations is not fixed and might not be a multiple that can fit exactly into a NEON register. This means that the compiler must test for remaining iterations to execute using non vectored code. If you know that your iteration count is one of those supported by NEON, you can indicate this to the compiler. Example 18 shows the final improvement that can be made to obtain the best performance from vectorization.

__inline int addition(int a, int b)
{
    return a + b;
}
void add_int(int * __restrict pa, int * __restrict pb, unsigned int n, int x)
{
    unsigned int i;
    __promise((n % 4) == 0);
    /* n is a multiple of 4 */
    for(i = 0; i < (n & ~3); i++) *(pa + i) = addition(*(pb + i),x);    
}