You copied the Doc URL to your clipboard.


To optimize your application:

  1. Start by compiling your application with the -mcpu=native and -O3 compiler options. Consider using the -Ofast option. For C code, try compiling your application with the -fsimdmath option. The -fsimdmath option provides a vectorised implementation of common libm calls.


    • For Fortran source, vector implementations are used, when possible, by default, but can be disabled using the -fnosimdmath compiler flag.

    • The -Ofast option enables all the optimizations from -O3, but also performs other aggressive optimizations that might violate strict compliance with language standards.

    • If your Fortran application runs into issues with -Ofast, to force automatic arrays on the heap, try -Ofast -fno-stack-arrays.

    • If -Ofast is not acceptable and produces the wrong results because of the reordering of math operations, use -O3 -ffp-contract=fast.

    • If -ffp-contract=fast does not produce the correct results, then use -O3.

    • For a full list of compiler options, see the Arm C/C++ Compiler reference guide and Arm Fortran Compiler reference guide.

  2. Use the optimized Arm Performance Libraries with the -armpl compiler option.

    Arm Performance Libraries provide optimized standard core math libraries for high-performance computing applications on Arm processors:

    • BLAS - Basic Linear Algebra Subprograms (including XBLAS, the

      extended precision BLAS).

    • LAPACK - a comprehensive package of higher level linear algebra routines.

    • FFT - a set of Fast Fourier Transform routines for real and complex data. Arm Performance Libraries support FFTWs Basic, Advanced, Guru, and MPI interfaces.

    The compiler option -armpl makes these libraries significantly easier to use with a simple interface to select thread-parallelism and architectural tuning. Arm Performance Libraries also provides improved Fortran math intrinsics with auto-vectorization.

    The -armpl and -mcpu options enable the compiler to find appropriate Arm Performance Libraries header files (during compilation) and libraries (during linking). Both options are required for the best results.


    • If your build process compiles and links as two separate steps, please ensure you add the same -armpl and -mcpu options to both. For more information about using the -armpl option, see Getting Started with Arm Performance Libraries on the Arm Developer website.

    • For GCC, you will need to load the correct environment module for the system and explicitly link to your chosen flavor (lp64/ilp64, mp) with full library path.

    For more information, refer to the Arm Performance Libraries Developer web page.

Get help with optimization

In armflang, optimization remarks are enabled by passing -Rpass command line options. Optimization remarks are a feature of LLVM compilers that provides information about the choices made by the compiler about inlining, vectorization, and more.

Optimization remarks are enabled by passing one or more of the following -Rpass flags at the command line:

Optimization remarks

-Rpass flags



To request information about what Arm Compiler has optimized.


To request information about what Arm Compiler has analyzed.


To request information about what Arm Compiler failed to optimize.

In each case, <regexp> is used to select the type of remarks to provide. For example, loop-vectorize for information on vectorization, and inline for information on in-lining. Rpass accepts regular expressions, so (loop-vectorize|inline) can be used to capture any remark on vectorization or in-lining.

Optimization remarks are piped to stdout at compile time. For more information, see Using Optimization Remarks with Arm Fortran Compiler or Using Optimization Remarks with Arm C/C++ Compiler.


Optimization remarks requires that an appropriate debug flag is set, such as -g.

  1. Use the Arm Compiler directives.

    Arm Fortran Compiler supports general-purpose and OpenMP-specific directives:

    • !DIR$ IVDEP - A generic directive to force the compiler to ignore any potential memory dependencies of iterative loops and vectorize the loop.

    • !$OMP SIMD - An OpenMP directive to indicate a loop can be transformed into a SIMD loop.

    • !DIR$ VECTOR ALWAYS - Forces the compiler to vectorize a loop irrespective of any potential performance implications.


      The loop must be vectorizable.

    • !DIR$ NO VECTOR - Disables vectorization of a loop.

    • !DIR$ UNROLL - Instructs the compiler to unroll the loop it precedes.

    • !DIR$ NOUNROLL - Instructs the compiler not to unroll the loop it precedes.

    For more information, see the directives section of the Arm Fortran Compiler reference guide.

  2. Optimize by iteration. Use Arm Forge Professional to iteratively debug and profile your ported application.

    Arm Forge is composed of the Arm DDT debugger and the Arm MAP profiler:

    • Use Arm DDT to debug your code to ensure application correctness. It can be used both in an interactive and non-interactive debugging mode, and optionally, integrated into your CI workflows.

    • Use Arm MAP to profile your code to measure your application performance. MAP collects a broad set of performance metrics, time classification metrics, specific metrics (for example MPI call and message rates, I/O data rates, energy data), and instruction information (hardware counters), to ensure a comprehensive understanding of the performance of your code. MAP also supports custom metrics so you can develope your own set of metrics of interest.

    Use the Arm Forge tools and follow an iterative identification and resolving cycle to optimize application performance:


    The 50x, 10x, 5x, and 2x numbers in the figure below are potential slow down factors that Arm has observed in real-world applications (when that aspect of performance is done incorrectly).

    Iterative optimization cycle.

    For more information, see the Arm Forge User Guide.

Was this page helpful? Yes No