Getting started with Arm C/C++ Compiler

Arm C/C++ Compiler is an auto-vectorizing compiler for the 64-bit Armv8-A architecture, with optional support for the Scalable Vector Extension (SVE). This tutorial shows how to compile and generate executables that will run on any 64-bit Armv8-A architecture.

Installing Arm Compiler for HPC

Refer to Installing Arm Compiler for HPC for details on how to perform the installation on Linux.

Environment Configuration

Note: Full instructions on configuring your environment for Arm Compiler for HPC are included in the installation guide.

Your administrator should have already installed Arm Compiler for HPC and made the environment module available.

To see which environment modules are available:

module avail

Note: you may need to configure the MODULEPATH environment variable to include the installation directory:

export MODULEPATH=$MODULEPATH:/opt/arm/modulefiles/

To configure your Linux environment to make Arm Fortran Compiler for HPC available:

module load <architecture>/<linux_variant>/<linux_version>/suites
/arm-compiler-for-hpc/<version>

For example:

module load Generic-AArch64/SUSE/12/suites/arm-compiler-for-hpc/18.1

You can check your environment by examining the PATH variable. It should contain the appropriate bin directory from /opt/arm, as installed in the previous section:

% echo $PATH 
/opt/arm/arm-compiler-for-hpc-18.1_Generic-AArch64_SUSE-12_aarch64-linux/bin:...

You can also use the which command to check that the Arm C/C++ Compiler armclang command is available:

% which armclang
/opt/arm/arm-compiler-for-hpc-18.1_Generic-AArch64_SUSE-12_aarch64-linux/bin/armclang

Note: You might want to consider adding the module load command to your .profile to run it automatically every time you log in.

Compiling and running a simple 'Hello World' program

This simple example illustrates how to compile and run a simple Hello World program.

  1. Create a simple "Hello World" program and save it in a file. In our case, we have saved it in a file named hello.c.

    /* Hello World */
    
    #include <stdio.h>
    int main() { printf("Hello World");
    return 0;
    }
  2. To generate an executable binary, compile your program with Arm C/C++ Compiler.

    armclang -o hello hello.c
  3. Now you can run the generated binary hello as shown below:

    ./hello

In the following sections we discuss the available compiler options in more detail and, towards the end of this tutorial, illustrate using them with a more advanced example.

Generating executable binaries from C and C++ code

To generate an executable binary, compile a program using:

armclang -o example1 example1.c

You can also specify multiple source files on a single line. Each source file is compiled individually and then linked into a single executable binary:

armclang -o example1 example1a.c example1b.c

Compiling and linking object files as separate steps

To compile each of your source files individually into an object file, specify the -c (compile-only) option, and then pass the resulting object files into another invocation of armclang to link them into an executable binary.

armclang -c -o example1a.o example1a.c
armclang -c -o example1b.o example1b.c
armclang -o example1 example1a.o example1b.o

Increasing the optimization level

To increase the optimization level, use the -Olevel option. The -O0 option is the lowest optimization level, while -O3 is the highest. Arm C/C++ Compiler only performs auto-vectorization at -O2 and higher, and uses -O0 as the default setting. The optimization flag can be specified when generating a binary, such as:

armclang -O3 -o example1 example1.c

 The optimization flag can also be specified when generating an object file:

armclang -O3 -c -o example1a.o example1a.c
armclang -O3 -c -o example1b.o example1b.c

 or when linking object files:

armclang -O3 -o example1 example1a.o example1b.o

Compiling and optimizing using CPU auto-detection

Arm C/C++ Compiler supports the use of the -mcpu=native option, for example:

armclang -O3 -mcpu=native -o example1 example1.c

This option enables the compiler to automatically detect the architecture and processor type of the CPU it is being run on, and optimize accordingly.

This option supports a range of Armv8-A based SoCs, including ThunderX2.

Note: the optimization performed according to the auto-detected architecture and processor is independent of the optimization level denoted by the -Olevel option.

Advanced example: Generating Arm assembly code from C and C++ code

Arm C/C++ Compiler can produce annotated assembly, and this is a good first step to see how the compiler vectorizes loops.

Note: Different compiler options are required to make use of SVE functionality. If you are using SVE, please refer to Compiling C/C++ code for Arm SVE architectures.

Example

The following C program subtracts corresponding elements in two arrays, writing the result to a third array. The three arrays are declared using the restrict keyword, indicating to the compiler that they do not overlap in memory.

// example1.c
#define ARRAYSIZE 1024
int a[ARRAYSIZE];
int b[ARRAYSIZE];
int c[ARRAYSIZE];
void subtract_arrays(int *restrict a, int *restrict b, int *restrict c)
{
    for (int i = 0; i < ARRAYSIZE; i++)
    {
        a[i] = b[i] - c[i];
    }
}

int main()
{
    subtract_arrays(a, b, c);
}

Compile the program as follows:

armclang -O1 -S -o example1.s example1.c

The flag -S is used to output assembly code.The output assembly code is saved as example1.s. The section of the generated assembly language file containing the compiled subtract_arrays function appears as follows:

subtract_arrays:                        // @subtract_arrays
// BB#0:
        mov     x8, xzr
.LBB0_1:                                // =>This Inner Loop Header: Depth=1
        ldr     w9, [x1, x8]
        ldr     w10, [x2, x8]
        sub     w9, w9, w10
        str     w9, [x0, x8]
        add     x8, x8, #4              // =4
        cmp     x8, #1, lsl #12         // =4096
        b.ne    .LBB0_1
// BB#2:
        ret

This code shows that the compiler has not performed any vectorization, because we specified the -O1 (low optimization) option. Array elements are iterated over one at a time. Each array element is a 32-bit or 4-byte integer, so the loop increments by 4 each time. The loop stops when it reaches the end of the array (1024 iterations * 4 bytes later).

Enable auto-vectorization

To enable auto-vectorization, increase the optimization level using the -Olevel option. The -O0 option is the lowest optimization level, while -O3 is the highest. Arm C/C++ Compiler only performs auto-vectorization at -O2 and higher:

armclang -O2 -S -o example1.s example1.c

The output assembly code is saved as example1.s. The section of the generated assembly language file containing the compiled subtract_arrays function appears as follows:

subtract_arrays:                        // @subtract_arrays
// BB#0:
        mov     x8, xzr
        add     x9, x0, #16             // =16
.LBB0_1:                                // =>This Inner Loop Header: Depth=1
        add     x10, x1, x8
        add     x11, x2, x8
        ldp     q0, q1, [x10]
        ldp     q2, q3, [x11]
        add     x10, x9, x8
        add     x8, x8, #32             // =32
        cmp     x8, #1, lsl #12         // =4096
        sub     v0.4s, v0.4s, v2.4s
        sub     v1.4s, v1.4s, v3.4s
        stp     q0, q1, [x10, #-16]
        b.ne    .LBB0_1
// BB#2:
        ret

This time, we can see that Arm C/C++ Compiler has done something different. SIMD (Single Instruction Multiple Data) instructions and registers have been used to vectorize the code. Notice that the LDP instruction is used to load array values into the 128-bit wide Q registers. Each vector instruction is operating on four array elements at a time, and the code is using two sets of Q registers to double up and operate on eight array elements in each iteration. Consequently each loop iteration moves through the array by 32 bytes (2 sets * 4 elements * 4 bytes) at a time.

Common compiler options

See armclang --help, Arm C/C++ Compiler command-line options, and the LLVM documentation for more information about all supported options.

-S
Outputs assembly code, rather than object code. Produces a text .s file containing annotated assembly code.
-c
Performs the compilation step, but does not perform the link step. Produces an ELF object .o file. To later link object files into an executable binary, run armclang again, passing in the object files.
-o file
Specifies the name of the output file.
-march=name[+[no]feature]
Targets an architecture profile, generating generic code that runs on any processor of that architecture. For example -march=armv8-a+sve.
-mcpu=native
Enables the compiler to automatically detect the CPU it is being run on and optimize accordingly. This supports a range of Armv8-A based SoCs, including ThunderX2.
-Olevel
Specifies the level of optimization to use when compiling source files. The default is -O0.
--help
Describes the most common options supported by Arm C/C++ Compiler. Also, use man armclang to see more detailed descriptions of all the options.
--version
Displays version information.

Getting help

For a list of all the supported options, use:

armclang --help

To see detailed descriptions of all supported options, use:

man armclang

For a list of command-line options, see Arm C/C++ Compiler command-line options.

If you have problems and would like to contact our support team, get in touch here.

Resources