You copied the Doc URL to your clipboard.

Optimize

After porting your application to Arm, the next step is to optimize it. The following steps describe how you can use the Arm compilers, debugger, and profiler to further enhance the performance of your application on Arm.

  1. Optimize you code using compiler optimization options:

    1. Enable auto-vectorization with the -O<x> options:

      Arm Compiler for Linux optimization options

      Option

      Description

      Auto-vectorization

      -O0

      Minimum optimization for the performance of the compiled binary. Turns off most optimizations. When debugging is enabled, this option generates code that directly corresponds to the source code. Therefore, this might result in a significantly larger image. This is the default optimization level.

      Never

      -O1

      Restricted optimization. When debugging is enabled, this option gives the best debug view for the trade-off between image size, performance, and debug.

      Disabled by default.

      -O2

      High optimization. When debugging is enabled, the debug view might be less satisfactory because the mapping of object code to source code is not always clear. The compiler might perform optimizations that cannot be described by debug information.

      Enabled by default.

      -O3

      Very high optimization. When debugging is enabled, this option typically gives a poor debug view. Arm recommends debugging at lower optimization levels.

      Enabled by default.

      -Ofast

      Enable all the optimizations from level 3, including those performed with the -ffp-mode=fast armclang option.

      This level also performs other aggressive optimizations that might violate strict compliance with language standards.

      Enabled by default.

      Note

      • The -Ofast option enables all the optimizations from -O3, but also performs other aggressive optimizations that might violate strict compliance with language standards. If your Fortran application has issues with -Ofast, to force automatic arrays on the heap, try -Ofast -fno-stack-arrays.

      • If -Ofast is not acceptable and produces the wrong results because of the reordering of math operations, use -O3 -ffp-contract=fast. If -ffp-contract=fast does not produce the correct results, then use -O3.

    For a more detailed description of auto-vectorizing your code for Arm Neon, see Compile for Neon with Auto-Vectorization.

    1. For C/C++ code, enable the vectorized implementation of libm math functions using the -fsimdmath option. Combine this with the -O<x> option from the previous step.

      Note

      For Fortran source, vector implementations are used, when possible, by default, but can be disabled using the -fnosimdmath compiler flag.

    2. Optimize for your hardware. To compile your code for your specific core, use the -mcpu option. Compiling for the specific core enables the compile to optimize knowing the architecture and microarchitectural versions implemented on that core.

    In summary, a typical set of compiler optimization options are:

    {armclang|armflang} -fsimdmath -mcpu=native -c -O3 <source>{.c|.f}
    

    For a full list of compiler optimization options, see the Arm C/C++ Compiler reference guide and Arm Fortran Compiler reference guide.

  2. Enable optimized Arm Performance Libraries functions.

    Arm Performance Libraries provide optimized standard core math libraries for high-performance computing applications on Arm processors:

    • BLAS - Basic Linear Algebra Subprograms (including XBLAS, the

      extended precision BLAS).

    • LAPACK - a comprehensive package of higher-level linear algebra routines.

    • FFT - a set of Fast Fourier Transform routines for real and complex data. Arm Performance Libraries support FFTWs Basic, Advanced, Guru, and MPI interfaces.

    Arm Performance Libraries also provides improved Fortran math intrinsics with auto-vectorization.

    To enable Arm Performance Libraries, add the -armpl option to your compile command line. -armpl provides a simple interface to select thread-parallelism and architectural tuning. Combining -armpl with the -mcpu option enables the compiler to find appropriate Arm Performance Libraries header files (during compilation) and libraries (during linking). Arm recommends using both options to achieve the best performance enhancement:

    {armclang|armflang} <options> code_with_math_routines{.c|.f} -armpl=<arg1>,<arg2>...
    

    Note

    • If your build process compiles and links as two separate steps, ensure that you add the same -armpl and -mcpu options to both. For more information about using the -armpl option, see Getting Started with Arm Performance Libraries on the Arm Developer website.

    • For GCC, you need to load the correct environment module for the system and explicitly link to your chosen flavor (lp64/ilp64, mp) with full library path.

    For a more detailed description about using -armpl, see the Library selection topic.

  3. Use Arm Compiler Optimization Remarks.

    Optimization Remarks provide you with information about the choices that are made by the compiler. Optimization Remarks can be used to see which code has been inlined or to understand why a loop has not been vectorized.

    To enable Optimization Remarks, pass one or more of the following -Rpass flags at compile time:

    -Rpass flags to enable optimization remarks

    -Rpass flags

    Description

    -Rpass=<regexp>

    To request information about what Arm Compiler has optimized.

    -Rpass-analysis=<regexp>

    To request information about what Arm Compiler has analyzed.

    -Rpass-missed=<regexp>

    To request information about what Arm Compiler failed to optimize.

    In each case, <regexp> is used to select the type of remarks to provide. For example, loop-vectorize for information on vectorization, and inline for information on in-lining, or .* to report all Optimization Remarks. Rpass accepts regular expressions, so (loop-vectorize|inline) can be used to capture any remark on vectorization or inlining.

    For example, to get actionable information on which loops can and cannot be vectorized at compile time, pass:

    -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -g
    

    Note

    • Optimization Remarks are only available when you have set an appropriate debug flag, for example -g.

    • Optimization Remarks are piped to stdout at compile time.

    For more information, see the Optimization Remarks documentation for Fortran or for C/C++.

  4. Use Arm Optimization Report.

    Arm Optimization Report is a new feature of Arm Compiler for Linux that builds upon the llvm-opt-report tool available in open-source LLVM. The new Arm Optimization Report feature makes it easier to see what optimization decisions the compiler is making about unrolling, vectorizing, and interleaving, all in-line with your source code.

    To enable Arm Optimization Report:

    1. At compile time, add the -fsave-optimization-record to the command line.

      A <filename>.opt.yaml report is generated by the compiler, where <filename> is the name of the binary.

    2. Use Arm Optimization Report (arm-opt-report) to inspect the <filename>.opt.yaml report as augmented source code:

      arm-opt-report <filename>.opt.yaml
      

      The annotated source code appears in the terminal.

  5. Use the Arm Compiler directives.

    Arm Fortran Compiler supports general-purpose and OpenMP-specific directives:

    • !DIR$ IVDEP - A generic directive to force the compiler to ignore any potential memory dependencies of iterative loops and vectorize the loop.

    • !$OMP SIMD - An OpenMP directive to indicate that a loop can be transformed into a SIMD loop.

    • !DIR$ VECTOR ALWAYS - Forces the compiler to vectorize a loop regardless of any potential performance implications.

      Note

      The loop must be vectorizable.

    • !DIR$ NO VECTOR - Disables vectorization of a loop.

    • !DIR$ UNROLL - Instructs the compiler to unroll the loop it precedes.

    • !DIR$ NOUNROLL - Instructs the compiler not to unroll the loop it precedes.

    For more information, see the directives section of the Arm Fortran Compiler reference guide.

  6. Optimize by iteration.

    Use Arm Performance Reports to characterize and understand the performance of HPC application runs. Use Arm Forge to debug and profile your ported application.

    Arm Forge is composed of the Arm DDT debugger and the Arm MAP profiler:

    • Use Arm DDT to debug your code to ensure that the application is correct. Arm DDT can be used both in an interactive and non-interactive debugging mode, and optionally, integrated into your CI workflows.

    • Use Arm MAP to profile your code to measure your application performance. To understand the code performance, Arm MAP collects:

      • A broad set of performance metrics.

      • A broad set of time classification metrics.

      • Instruction information (hardware counters).

      • Specific metrics (for example MPI call and message rates, I/O data rates, and energy data).

      • Custom metrics (metrics defined by you).

    To optimize application performance, use the Arm Performance Reports and Arm Forge tools and follow an iterative identification and resolving cycle:

    1. Run your code on real workloads and generate a performance report with Arm Performance Reports.

    2. Use Arm Forge to examine the I/O and trace and debug suspicious or slow access patterns.

      Common problems include: * Checkpointing too often. * Many small read and writes. * Using your home directory instead of the scratch directory. * Multiple nodes using the filesystem a the same time.

    3. Use Arm Performance Reports to identify the workload balance of your application, then use Arm Forge to dive into partitioning code.

      Common problems include:

      • Your dataset is too small to efficiently run at scale.

      • I/O contention causes late sender.

      • Partitioning code bugs.

    4. Use Arm Forge to identify lines of code that are causing memory access pattern problems.

      Common problems include:

      • Initializing memory on one core but using it on a different core.

      • Arrays of structures causing inefficient cache utilization.

      • Caching results when re-computation is more efficient.

    5. Use Arm Performance Reports to track communication performance, and Arm Forge to see which communication calls are slow and why.

      Common problems include:

      • Short, high-frequency messages are very sensitive to latency.

      • Too many synchronizations.

      • No overlap between communication and computation.

    6. Use Arm Performance Reports to observe the core utilization and synchronization overhead, then use Arm Forge to identify the corresponding code.

      Common problems include:

      • Implicit thread barriers inside tight loops.

      • significant core idle time because of worload imbalance.

      • Threads migrating between cores at runtime.

    7. Use Arm Performance Reports to observe the numerical intensity and level of vectorization, and Arm Forge to identify the hot loops and unvectorized code.

      Common problems include:

      • Sub-optimal compiler options for your system.

      • Numerically-intensive loops with hard to vectorize patterns.

      • Not utilizing highly-optimized math libraries.

    Example slowdown factors that can occur are:

    Iterative optimization cycle.

    Note

    The 50x, 10x, 5x, and 2x numbers in the figure are potential slowdown factors that Arm has observed in real-world applications (when that aspect of performance is done incorrectly).

    For more information about using analysis, debugging, and profiling tools, see the Arm Performance Reports and Arm Forge user guides.

Was this page helpful? Yes No