Optimize your application

To optimize your application:

  1. Start by compiling your application with the -mcpu=native and -O3 compiler options. Consider using the -Ofast option.

    For C code, try compiling your application with the -fsimdmath option.

    Notes:

    • The -Ofast option enables all the optimizations from -O3, but also performs other aggressive optimizations that might violate strict compliance with language standards.
    • If your Fortran application runs into issues with -Ofast, to force automatic arrays on the heap, try -Ofast -fno-stack-arrays.
    • If -Ofast is not acceptable and produces the wrong results because of the reordering of math operations, use -O3 -ffp-contract=fast.
    • If -ffp-contract=fast does not produce the correct results, then use -O3.
    • For a full list of compiler options, see the Arm C/C++ Compiler reference guide and Arm Fortran Compiler reference guide.
  2. Use the optimized Arm Performance Libraries math functions with -armpl.

    Arm Performance Libraries provide optimized standard core math libraries for high-performance computing applications on Arm processors:

    • BLAS - Basic Linear Algebra Subprograms (including XBLAS, the extended precision BLAS).
    • LAPACK - a comprehensive package of higher level linear algebra routines.
    • FFT - a set of Fast Fourier Transform routines for real and complex data.
    • Math Routines - Optimized implementions of common maths intrinsics
    • Auto-vectorization - of Fortran math intrinsics (disable this with -fno-simdmath)

    Arm Compiler for HPC 19.0 introduces a new compiler option -armpl, which makes these libraries significantly easier to use with a simple interface to select thread-parallelism and architectural tuning. Arm Performance Libraries also provides improved Fortran math intrinsics with auto-vectorization.

    The -armpl and -mcpu options enable the compiler to find appropriate Arm Performance Libraries header files (during compilation) and libraries (during linking). Both options are required for the best results.

    Notes:

    • If your build process compiles and links as two separate steps, please ensure you add the same -armpl and -mcpu options to both. For more information about using the -armpl option, see Arm Performance Libraries Getting Started guide.

    • For GCC, you will need to load the correct environment module for their system and explicitly link to their chosen flavour (lp64/ilp64, mp) with full library path.

    For more information, the Arm Performance Libraries product page.

  3. Use Arm Compiler optimization remarks.

    Optimization remarks provide you with information about the choices made by the compiler. They can be used to see which code has been inlined or to understand why a loop has not been vectorized.

    For example, to get actionable information on which loops can, and cannot (including why), be vectorized, pass:

    -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -g

    to armflang, armclang, or armclang++.

    Note: You must include the -g debug option.

    For more information, depending on your applications code, see:

  4. Use the Arm Compiler directives.

    Arm Fortran Compiler supports general-purpose and OpenMP-specific directives:

    • !DIR$ IVDEP - A generic directive to force the compiler to ignore any potential memory dependencies of iterative loops and vectorize the loop.

    • !$OMP SIMD - An OpenMP directive to indicate a loop can be transformed into a SIMD loop.

    • Note: -fopenmp must be set. There is currently no support for OMP SIMD clauses.
    • !DIR$ VECTOR ALWAYS - Forces the compiler to vectorize a loop irrespective of any potential performance implications.

      Note: the loop must be vectorizable.
    • !DIR$ NO VECTOR - Disables vectorization of a loop.

    For more information, see the directives section of the Arm Fortran Compiler reference guide.

  5. Optimize by iteration. Use Arm Forge Professional to iteratively debug and profile your ported application.

    Arm Forge is composed of the Arm DDT debugger and the Arm MAP profiler:

    • Use Arm DDT to debug your code to ensure application correctness. It can be used both in an interactive and non-interactive debugging mode, and optionally, integrated into your CI workflows.

    • Use Arm MAP to profile your code to measure your application performance. MAP collects a broad set of performance metrics, time classification metrics, specific metrics (for example MPI call and message rates, I/O data rates, energy data), and instruction information (hardware counters), to ensure a comprehensive understanding of the performance of your code. MAP also supports custom metrics so you can developer your own set of metrics of interest.

    Use the Arm Forge tools and follow an iterative identification and resolving cycle to optimize application performance:

    Note: The 50x, 10x, 5x, and 2x numbers in the figure below are potential slow down factors that Arm has observed in real-world applications (when that aspect of performance is done incorrectly).

    Iterative Optimization cycle

    For more information, see the Arm Forge User Guide.

Previous Next