Overview

Most applications will port onto Arm Architecture with little or no modification:

  • Arm is supported by all major Linux distributions, which provide a rich library of common Linux packages built for Aarch64.

  • GCC is fully supported.

  • Arm Compiler accepts GCC compiler options, wherever possible.

There are a few features of the Arm architecture that may impact your application, these are detailed under 'Troubleshooting' in the 'Port your application' section.


Port your application

To port your application, follow these steps:

Note: If you encounter any issues with your build, see the Port your application - Troubleshooting section below.
  1. Ensure all your application dependencies have been ported.

    Use of external libraries is increasingly common, and a conscious design choice for many projects. Common dependencies include:

    • IO libraries.
      For example, HDF5 and NetCDF (C, parallel, and Fortran flavors).

    • Linear Solvers.
      For example PETSc, HYPRE, Trilinos, BLAS, LAPACK and ScaLAPACK routines.

    • Fast fourier transforms.
      For example, FFTW.

    • Communication layers, or execution environments.
      For example, Open MPI, OpenUCX, and Charm ++.

    • Libraries providing performance portability and memory abstraction.
      For example, Kokkos and RAJA.

    In most cases you'll find that these dependencies have been built on Arm before, with the Arm and GNU toolchains:

  2. Check you are using the correct compiler.

    During your build configuration, choose the C, C++ and Fortran compilers to use. For example, for the Arm Compiler you would typically set:

    CC=armclang 
    CXX=armclang++
    FC=armflang
    F77=armflang

    For GNU compilers:

    CC=gcc
    CXX=g++
    FC=gfortran
    F77=gfortran
    Note: For MPI builds (for example, Open MPI) you might need to use the MPI wrappers. These are usually the same for all compilers:
    CC=mpicc
    CXX=mpicxx
    FC=mpifort
  3. Check you are using the right compiler options. Most GCC options are supported by the Arm Compiler. It is recommended that you use -mcpu=native (for the Arm Compiler) or -march=native (for GCC) in addition to any other options to ensure you get compiled code that is tuned for the micro-architecture of your machine.

  4. Build your application as you would normally.

  5. Run your test suite.

    Warning: Regression tests that rely on bit-wise identical answers might not be portable between architectures.

Port your application - Troubleshooting

Here are some problems you might encounter while porting your application:

  1. Configure is unable to identify your platform.

    This may be due to the config.guess supplied with the application being out of date. This can also be true for a config.guess already installed on your system and used by some configure scripts.

    To fix this problem, obtain up-to-date versions:

    wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD' -O config.guess
    wget 'http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD' -O config.sub
  2. Libtool fails to link Fortran applications or interfaces.

    Libtool does not recognize the Arm Compiler as a Fortran compiler. Therefore, it is unable to set the correct flags for linking the binary.

    To ensure Libtool uses the correct compiler options with Arm Compiler, modify it (after running configure), using:

    sed -i -e 's#wl=""#wl="-Wl,"#g' libtool
    sed -i -e 's#pic_flag=""#pic_flag=" -fPIC -DPIC"#g' libtool

    Some widely used applications and libraries, for example Open MPI, have already incorporated a patch for Libtool to address this issue.

  3. There are compiler-dependent #ifdefs in the Makefile which are not being set.

    You might need to update the source to use the _FLANG and _clang macros, or manually set existing compiler macros, such as -D_PGI.

  4. Your code might be making use of language features which are not currently supported by the Arm Compiler.

    Check the support status of the compiler, for:

  5. Are you experiencing a race condition you have not encountered before?

    AArch64 adopts a weak memory model. This means that read and writes can be re-ordered. In some cases it means that explicit memory barriers are needed on Aarch64 that were not required on other architectures.

  6. Do you have an integer divide by zero?

    On Aarch64, an integer divide by zero does not generate an error, instead it returns as zero.

    Note: this is not the case for floating-point divide by zero.

    On rare occasions, an undetected divide by zero might be allowing an application to run erroneously when it should have crashed. It might be necessary to catch attempted divide-by-zero's in software.

Optimize your application

To optimize your application:

  1. Start by compiling your application with the -mcpu=native and -O3 compiler options. Consider using the -Ofast option.

    For C code, try compiling your application with the -fsimdmath option.

    Notes:

    • The -Ofast option enables all the optimizations from -O3, but also performs other aggressive optimizations that might violate strict compliance with language standards.
    • If your Fortran application runs into issues with -Ofast, to force automatic arrays on the heap, try -Ofast -fno-stack-arrays.
    • If -Ofast is not acceptable and produces the wrong results because of the reordering of math operations, use -O3 -ffp-contract=fast.
    • If -ffp-contract=fast does not produce the correct results, then use -O3.
    • For a full list of compiler options, see the Arm C/C++ Compiler reference guide and Arm Fortran Compiler reference guide.
  2. Use the optimized Arm Performance Libraries math functions with -armpl.

    Arm Performance Libraries provide optimized standard core math libraries for high-performance computing applications on Arm processors:

    • BLAS - Basic Linear Algebra Subprograms (including XBLAS, the extended precision BLAS).
    • LAPACK - a comprehensive package of higher level linear algebra routines.
    • FFT - a set of Fast Fourier Transform routines for real and complex data.
    • Math Routines - Optimized implementions of common maths intrinsics
    • Auto-vectorization - of Fortran math intrinsics (disable this with -fno-simdmath)

    Arm Compiler for HPC 19.0 introduces a new compiler option -armpl, which makes these libraries significantly easier to use with a simple interface to select thread-parallelism and architectural tuning. Arm Performance Libraries also provides improved Fortran math intrinsics with auto-vectorization.

    The -armpl and -mcpu options enable the compiler to find appropriate Arm Performance Libraries header files (during compilation) and libraries (during linking). Both options are required for the best results.

    Notes:

    • If your build process compiles and links as two separate steps, please ensure you add the same -armpl and -mcpu options to both. For more information about using the -armpl option, see Arm Performance Libraries Getting Started guide.

    • For GCC, you will need to load the correct environment module for their system and explicitly link to their chosen flavour (lp64/ilp64, mp) with full library path.

    For more information, the Arm Performance Libraries product page.

  3. Use Arm Compiler optimization remarks.

    Optimization remarks provide you with information about the choices made by the compiler. They can be used to see which code has been inlined or to understand why a loop has not been vectorized.

    For example, to get actionable information on which loops can, and cannot (including why), be vectorized, pass:

    -Rpass=loop-vectorize -Rpass-missed=loop-vectorize -Rpass-analysis=loop-vectorize -g

    to armflang, armclang, or armclang++.

    Note: You must include the -g debug option.

    For more information, depending on your applications code, see:

  4. Use the Arm Compiler directives.

    Arm Fortran Compiler supports general-purpose and OpenMP-specific directives:

    • !DIR$ IVDEP - A generic directive to force the compiler to ignore any potential memory dependencies of iterative loops and vectorize the loop.

    • !$OMP SIMD - An OpenMP directive to indicate a loop can be transformed into a SIMD loop.

    • Note: -fopenmp must be set. There is currently no support for OMP SIMD clauses.
    • !DIR$ VECTOR ALWAYS - Forces the compiler to vectorize a loop irrespective of any potential performance implications.

      Note: the loop must be vectorizable.
    • !DIR$ NO VECTOR - Disables vectorization of a loop.

    For more information, see the directives section of the Arm Fortran Compiler reference guide.

  5. Optimize by iteration. Use Arm Forge Professional to iteratively debug and profile your ported application.

    Arm Forge is composed of the Arm DDT debugger and the Arm MAP profiler:

    • Use Arm DDT to debug your code to ensure application correctness. It can be used both in an interactive and non-interactive debugging mode, and optionally, integrated into your CI workflows.

    • Use Arm MAP to profile your code to measure your application performance. MAP collects a broad set of performance metrics, time classification metrics, specific metrics (for example MPI call and message rates, I/O data rates, energy data), and instruction information (hardware counters), to ensure a comprehensive understanding of the performance of your code. MAP also supports custom metrics so you can developer your own set of metrics of interest.

    Use the Arm Forge tools and follow an iterative identification and resolving cycle to optimize application performance:

    Note: The 50x, 10x, 5x, and 2x numbers in the figure below are potential slow down factors that Arm has observed in real-world applications (when that aspect of performance is done incorrectly).

    Iterative Optimization cycle

    For more information, see the Arm Forge User Guide.