Optimize your application
To optimize your application:
Start by compiling your application with the
-O3compiler options. Consider using the
For C code, try compiling your application with the
-Ofastoption enables all the optimizations from
-O3, but also performs other aggressive optimizations that might violate strict compliance with language standards.
- If your Fortran application runs into issues with
-Ofast, to force automatic arrays on the heap, try
-Ofastis not acceptable and produces the wrong results because of the reordering of math operations, use
-ffp-contract=fastdoes not produce the correct results, then use
- For a full list of compiler options, see the Arm C/C++ Compiler reference guide and Arm Fortran Compiler reference guide.
Use the optimized Arm Performance Libraries math functions with
Arm Performance Libraries provide optimized standard core math libraries for high-performance computing applications on Arm processors:
- BLAS - Basic Linear Algebra Subprograms (including XBLAS, the extended precision BLAS).
- LAPACK - a comprehensive package of higher level linear algebra routines.
- FFT - a set of Fast Fourier Transform routines for real and complex data.
- Math Routines - Optimized implementions of common maths intrinsics
- Auto-vectorization - of Fortran math intrinsics (disable this with -fno-simdmath)
Arm Compiler for HPC 19.0 introduces a new compiler option
-armpl, which makes these libraries significantly easier to use with a simple interface to select thread-parallelism and architectural tuning. Arm Performance Libraries also provides improved Fortran math intrinsics with auto-vectorization.
-mcpuoptions enable the compiler to find appropriate Arm Performance Libraries header files (during compilation) and libraries (during linking). Both options are required for the best results.
If your build process compiles and links as two separate steps, please ensure you add the same
-mcpuoptions to both. For more information about using the
-armploption, see Arm Performance Libraries Getting Started guide.
For GCC, you will need to load the correct environment module for their system and explicitly link to their chosen flavour (lp64/ilp64, mp) with full library path.
For more information, the Arm Performance Libraries product page.
Use Arm Compiler optimization remarks.
Optimization remarks provide you with information about the choices made by the compiler. They can be used to see which code has been inlined or to understand why a loop has not been vectorized.
For example, to get actionable information on which loops can, and cannot (including why), be vectorized, pass:
-Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -g
armclang++.Note: You must include the
For more information, depending on your applications code, see:
Use the Arm Compiler directives.
Arm Fortran Compiler supports general-purpose and OpenMP-specific directives:
!DIR$ IVDEP- A generic directive to force the compiler to ignore any potential memory dependencies of iterative loops and vectorize the loop.
!$OMP SIMD- An OpenMP directive to indicate a loop can be transformed into a SIMD loop.
!DIR$ VECTOR ALWAYS- Forces the compiler to vectorize a loop irrespective of any potential performance implications.
Note: the loop must be vectorizable.
!DIR$ NO VECTOR- Disables vectorization of a loop.
fopenmpmust be set. There is currently no support for OMP SIMD clauses.
For more information, see the directives section of the Arm Fortran Compiler reference guide.
Optimize by iteration. Use Arm Forge Professional to iteratively debug and profile your ported application.
Arm Forge is composed of the Arm DDT debugger and the Arm MAP profiler:
Use Arm DDT to debug your code to ensure application correctness. It can be used both in an interactive and non-interactive debugging mode, and optionally, integrated into your CI workflows.
Use Arm MAP to profile your code to measure your application performance. MAP collects a broad set of performance metrics, time classification metrics, specific metrics (for example MPI call and message rates, I/O data rates, energy data), and instruction information (hardware counters), to ensure a comprehensive understanding of the performance of your code. MAP also supports custom metrics so you can developer your own set of metrics of interest.
Use the Arm Forge tools and follow an iterative identification and resolving cycle to optimize application performance:Note: The 50x, 10x, 5x, and 2x numbers in the figure below are potential slow down factors that Arm has observed in real-world applications (when that aspect of performance is done incorrectly).
For more information, see the Arm Forge User Guide.