Getting started with Arm Instruction Emulator

This tutorial uses a series of simple examples to demonstrate how to compile SVE code, run the resulting executable and gather profiling data with Arm Instruction Emulator.


Arm Instruction Emulator is an emulator that runs on AArch64 platforms and emulates Scalable Vector Extension (SVE) instructions. The emulator lets you develop and compile SVE code with Arm Compiler for HPC, then run the SVE binary without needing access to SVE-enabled hardware.


This tutorial also uses the Arm C/C++ Compiler from Arm's suite of HPC tools.

See Installing Arm Compiler for HPC and Environment configuration for instructions on installing and configuring your Linux environment for Arm Compiler for HPC, respectively.

Note: Also ensure you have loaded the necessary modules for Arm C/C++ Compiler before beginning this tutorial.

Installing Arm Instruction Emulator

Refer to Installing Arm Instruction Emulator for details on how to perform the installation on Linux.

Environment configuration

Your administrator should have already installed Arm Instruction Emulator and made the Environment Module available.

To see which Environment Modules are available:

module avail

Note: you may need to configure the MODULEPATH environment variable to include the installation directory:

export MODULEPATH=$MODULEPATH:/opt/arm/modulefiles/

To configure your Linux environment to make Arm Instruction Emulator available:

module load <architecture>/<linux_variant>/<linux_version>/suites/arm-instruction-emulator/<version>

For example:

module load Generic-AArch64/SUSE/12/suites/arm-instruction-emulator/1.2.1

You can check your environment by examining the PATH variable. It should contain the appropriate Arm Instruction Emulator bin directory from /opt/arm, as installed in the previous section:

echo $PATH /opt/arm/arm-instruction-emulator-1.2.1_Generic-AArch64_SUSE-12_aarch64-linux/bin:...

Simple example: Compile and run Hello World program

In this example you will write a Hello World program, compile it using Arm C/C++ Compiler, and run it using Arm Instruction Emulator.

  1. Create a simple "Hello World" C program and save it as a file. In our case, we have saved it in a file named hello.c.

    /* Hello World */

    #include <stdio.h>

    int main()
    printf("Hello World");
    return 0;
  2. To generate an executable binary, compile your program with Arm C/C++ Compiler.

    armclang -O3 -march=armv8-a+sve -o hello hello.c

    The -O3 flag ensures the highest optimization level with auto-vectorization is enabled. The -march=armv8-a+sve flag targets hardware with Armv8-A architecture.

    Note: In this example, no SVE code is used. However, it is good practice to enable the highest level of auto-vectorization and target an SVE-enabled architecture when compiling any code to be run using Arm Instruction Emulator.

  3. Run the generated binary hello using Arm Instruction Emulator:

    armie -msve-vector-bits=256 ./hello
    Hello World

    For this simple Hello World example, Arm Instruction Emulator runs the code on an emulated SVE-enabled architecture without utilizing SVE instructions.

    To use Arm Instruction Emulator to its full potential, that is, to emulate SVE instructions, we need to look at a more complex program. An example of a program containing SVE code is available in the next section of this tutorial.

Advanced example: Compile and run a program with SVE code

In this example we demonstrate using Arm C/C++ Compiler to compile and vectorize an example with SVE code, targeting the SVE-enabled Armv8-A architecture. We then use Arm Instruction Emulator to emulate running the SVE code.

  1. Create a new file called example.c. Open the file, insert the following C code, and save and close the file.

    // example.c
    #include <stdio.h>
    #include <stdlib.h>
    #define ARRAYSIZE 1024
    int a[ARRAYSIZE];
    int b[ARRAYSIZE];
    int c[ARRAYSIZE];
    void subtract_arrays(int *restrict a, int *restrict b, int *restrict c)
        for (int i = 0; i < ARRAYSIZE; i++)
            a[i] = b[i] - c[i];
    int main() {
        for (int i = 0; i < ARRAYSIZE; i++)
          // Generate a random number between 200 and 300
          b[i] = (rand() % 100) + 200;
          // Generate a random number between 0 and 100
          c[i] = rand() % 100;
        subtract_arrays(a, b, c);
        printf("i \ta[i] \tb[i] \tc[i] \n");
        for (int i = 0; i < ARRAYSIZE; i++)
            printf("%d \t%d \t%d \t%d\n", i, a[i], b[i], c[i]);

    This C program subtracts corresponding elements in two arrays, writing the result to a third array. The three arrays are declared using the restrict keyword, indicating to the compiler that they do not overlap in memory.

  2. Compile the program as follows:

    armclang -O3 -march=armv8-a+sve -o example example.c
  3. Run the binary using Arm Instruction Emulator:

    armie -msve-vector-bits=256 ./example

    To return:

    i       a[i]    b[i]    c[i]
    0       197     283     86
    1       262     277     15
    2       258     293     35
    1021    165     234     69
    1022    232     295     63
    1023    204     235     31

    The SVE architecture extension specifies an implementation-defined vector length. The -msve-vector-bits option lets you specify the vector length used by Arm Instruction Emulator. The vector length is a multiple of 128 bits, with a maximum of 2048 bits. Use the -mlist-vector-lengths option to list all valid vector lengths:

    armie -mlist-vector-lengths

    To return:

    128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048

Advanced example: Gathering profiling data with Arm Instruction Emulator

Arm Instruction Emulator helps you understand which parts of your code affect program performance. It samples which instruction is being executed at a user-specified frequency while the program is running.

This feature is available in Arm Instruction Emulator version 1.1 and later.

  1. This example uses the LULESH 2.0 simulation. First download and build LULESH 2.0 by following these steps:

    1. Download the latest release version of LULESH 2.0 CPU Models from At the time of writing, the latest version is 2.0.3:

    2. Uncompress and extract the downloaded package:

      tar -xvf lulesh2.0.3.tgz
    3. By default, the LULESH build configuration compiles using g++. We'll change this to use Arm C/C++ Compiler for HPC and generate insights by making the following changes in the Makefile:

      Change from
      Change to
      SERCXX = g++ -DUSE_MPI=0 SERCXX = armclang++ -DUSE_MPI=0
      CXX = $(MPICXX) CXX = $(SERCXX)
      CXXFLAGS = -g -O3 -fopenmp -I. -Wall CXXFLAGS = -g -O3 -fopenmp -I. -Wall -march=armv8-a+sve -insight
    4. To build the LULESH application:


      The build produces an executable binary, lulesh2.0 in the current directory.

  2. Run the LULESH 2.0 binary with armie using the --profile-period or –p option to specify the sample period in microseconds:

    armie -msve-vector-bits=512 -p 100 -- ./lulesh2.0 -s 9

    This runs LULESH 2.0, sampling the program counter every 100 microseconds. When the program terminates, a samples file is created in the current directory with the name format <binary name>_<PID>.samples, for example: lulesh2.0_3076.samples. This file contains a list of the samples taken. The samples are the instruction address followed by the number of executions, for example:

    head lulesh2.0_3076.samples

    To return:

    0x402578 62
    0x402580 51 
    0x406e60 22 
    0x406e5c 20 
    0x406e58 14 
    0x402570 14 
    0x406e64 12 
    0x406004 10 
    0x406214 10 
    0x406630 9 
  3. This format enables you to use GNU Linux tools like addr2line which map instruction addresses to source, to understand program behavior.

    Using the addr2func helper script, which comes as part of the release, you can use the samples file to identify which functions were the hottest in the LULESH 2.0 run:

    addr2func lulesh2.0 lulesh2.0_3076.samples

    To return:

    CalcElemNodeNormals: 143
    SumElemFaceNormal: 2
    ApplyMaterialPropertiesForElems: 12
    CalcEnergyForElems: 115
    CalcPressureForElems: 311
    CalcMonotonicQGradientsForElems: 3
    CalcElemCharacteristicLength: 1
    SQRT: 1
    CalcHourglassControlForElems: 1
    CalcForceForNodes: 1
    IntegrateStressForElems: 1
    CollectDomainNodesToElemNodes: 3
    Domain::xd: 1
    CalcElemFBHourglassForce: 1
    UpdateVolumesForElems: 11
    InitStressTermsForElems: 8
    CalcFBHourglassForceForElems: 3
    CalcMonotonicQRegionForElems: 2
    ApplyAccelerationBoundaryConditionsForNodes: 2
    FABS: 14
    EvalEOSForElems: 103
    The hottest function in the list is CalcPressureForElems; which was executed 311 times.

    the accuracy of the sampling profiler and, thus, the accurate performance measurement of programs, depends on their run time. The longer the run, the more accurate the numbers describing hot code.


In the event of a program crash, the operating system kernel creates a core dump file. The location and name of this core dump file depends on your system's core dump configuration. If your configuration specifies that core dump filenames include the name of the crashed binary, note that this is the name of the executable being emulated rather than the Arm Instruction Emulator binary name armie.

Core dump files should be sent to Arm support along with the output of armie --version. However, if you have confidentiality concerns regarding sensitive data in the core dump file, do not send the core dump to Arm. Note that this may mean Arm cannot investigate your issue.

If you encounter problems running a binary with Arm Instruction Emulator, use the --debug option to run internal checks (assert calls) during execution. If Arm Instruction Emulator finds an internal inconsistency it will stop executing and output a message to stderr which you should send to Arm support, for example use

armie -msve-vector-bits=256 --debug ./example

To output:

example: ./src/sve_decode.h:93: aarch64_i_rsp_reg::aarch64_i_rsp_reg(unsigned int, 
aarch64_i_rsp_reg::element_type): Assertion `reg_id < 32' failed. 

Alternatively, to print output messages to an output file, include -o or --output in the command line input.

The --debug option also helps you identify the instructions that were executed by the emulator. The first column is the address of the instruction, the second is the instruction encoding and the third is the number of times the instruction was executed, for example:

0x400684: 0x043f57df 1
0x4006a0: 0x04bf5028 1
0x4006c8: 0x2538c000 1
0x4006cc: 0x25291fe0 1
0x4006d4: 0xe4084140 13
0x4006d8: 0x04285028 13
0x4006dc: 0x25291d00 13
0x4006ec: 0x25a91fe0 1
0x4006f4: 0xe58103a0 1
0x4006f8: 0x04a14500 13
0x4006fc: 0xe5484140 13
0x400700: 0x04b0e3e8 13
0x400704: 0x25a91d00 13
0x400740: 0x858103a1 1
0x40074c: 0x25b8c020 1
0x400758: 0x2598e3e0 1
0x40075c: 0xa5484521 13
0x400760: 0x04938001 13
0x400764: 0xe5484541 13
0x400768: 0x04b0e3e8 13
0x40076c: 0x25ab1d01 13
0x4007bc: 0x043f505f 1

For more information about getting help, see Contacting Arm Support.