Analyze SVE programs with Arm Instruction Emulator

Running an SVE program as described in Get started with Arm Instruction Emulator verifies that the code you have developed can run on SVE hardware. However, if you are developing high-performance programs, runtime analysis is required to gain insights into their execution behavior. Runtime analysis enables you to identify heavily-used loops and instruction sequences so that improvements can be made to execution speed and memory access.

ArmIE uses DynamoRIO to emulate and instrument SVE binaries on AArch64 hardware. DynamoRIO is a publicly available dynamic binary instrumentation (DBI) tool platform which supports x86 and Arm binaries. It provides an API which enables you to write your own binary level runtime instrumentation, as well as supplying some example instrumentation. Each ArmIE release integrates a stable version of DynamoRIO which you can download as one seamless package. 

ArmIE provides a set of instrumentation clients which can be used to analyze SVE binaries at runtime. The term 'instrumentation client' in this context refers to the way ArmIE uses DynamoRIO to work as an analysis tool as well as an emulator.

For example, one ArmIE instrumentation feature is called Regions-of-Interest (ROI). Sometimes, when analyzing large, complex, and long running programs, it is necessary to limit the amount of runtime data collected (such as memory traces, instruction, and opcode counts) to specific parts of code. You can use the ROI feature to collect runtime data for regions of the code marked with ROI markers. To add ROI markers and build the application, you must have access to the source code under analysis. To mark a ROI, use start and stop macros in the source.

Note: There are restrictions to the use of ROI markers in source code. ROIs must not be nested and they must not overlap. Violating these restrictions will result in undefined behaviour.

To emulate and analyze an SVE binary, invoke ArmIE with an instrumentation client and the SVE binary. The client is a shared object file which uses the DynamoRIO API to capture and process desired runtime events.

Procedure

  1. Invoke ArmIE with an instrumentation client and the binary:

    armie -msve-vector-bits=<arg> -i <instrumentation_client> -- ./<binary>
  2. Analyze the results provided by the instrumentation client.

Example - Count AArch64 and emulated SVE instructions

This example illustrates the use of a very basic instrumentation client, which counts native AArch64 and emulated SVE instructions.

  1. Invoke ArmIE with an instrumentation client named libinscount_emulated.so and run the example binary:

    $ armie -msve-vector-bits=128 -i libinscount_emulated.so -- ./example
    This returns:

    Client inscount is running
    SVE: 0x00000000004006c8 0x25a91fe0
    SVE: 0x00000000004006d0 0xa54842a0
    SVE: 0x00000000004006d4 0xa54842c1
    SVE: 0x00000000004006d8 0x04a10400
    SVE: 0x00000000004006dc 0xe54842e0
    SVE: 0x00000000004006e0 0x04b0e3e8
    SVE: 0x00000000004006e4 0x25a91d00
    SVE: 0x00000000004006d0 0xa54842a0
    SVE: 0x00000000004006d4 0xa54842c1
    SVE: 0x00000000004006d8 0x04a10400
    SVE: 0x00000000004006dc 0xe54842e0
    SVE: 0x00000000004006e0 0x04b0e3e8
    SVE: 0x00000000004006e4 0x25a91d00
    i       a[i]    b[i]    c[i] ============================= 0       197     283     86 1       262     277     15 2       258     293     35 3       194     286     92 . . . 1019    243     290     47 1020    185     261     76 1021    165     234     69 1022    232     295     63 1023    204     235     31 2134094 instructions executed of which 1537 were emulated instructions $

    Notice the difference in output from the example shown in Getting started with Arm Instruction Emulator (see section Compile, vectorize and run a program with SVE code) which did not use -i libinscount_emulated.so. The additional information is what the instrumentation client libinscount_emulated.so outputs as part of its analysis of the example binary as it is running:

    Client inscount is running
    SVE: 0x00000000004006c8 0x25a91fe0
    ...
    2134094 instructions executed of which 1537 were emulated instructions
  2. The above example ran with 128-bit vectors and the instruction count does not provide much insight. To investigate the effect vector length has on the number of SVE instructions executed, for example to minimize them and help reduce time spent in execution, run the example binary with each vector length and tabulate the results:

    Vector Length 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048
    SVE Instructions 1537 769 517 385 313 259 223 193 175 157 145 133 121 115 109 97
  3. Plot the results on a line graph:
    The effect vector length has on the number of SVE instructions executed in this example
    The graph shows us that the largest reduction in SVE instructions executed occurs between 128 and about 512 bits. This type of analysis of an application's runtime behavior can be used with other types of analysis to study the impact of vector length on performance.

Example - Analyze Regions-of-Interest (ROI)

This examples illustrates the use of the libinscount_emulated.so client, an instrumentation client that allows you to limit the amount of runtime data collected to specific parts of code. This is particularly useful when analyzing large, complex, or long-running programs.

This program used in this example, loops, contains two loops. This example uses the ROI feature to limit instruction counting to a single loop. First, the first loop is investigated, then the second is investigated and compared. The initial source code for loops is:

#define N 42
int a[N], b[N], c[N];

int main(void) {

  a[0] = 0;
  b[0] = 1;
  c[0] = a[0] + b[0];

for(int i=0; i<N; ++1)
  c[i] = i;

for(int i=0; i<N; ++i)
a[i] = b[i] + b[c[i]];
}
  
  1. Build and run the example loops program with the libinscount_emulated.so client:

    $ armie -msve-vector-bits=512 -i libinscount_emulated.so ./loops
    Client inscount is running
    89539 instructions executed of which 36 were emulated instructions
    $

    All of the instructions executed are counted.

  2. To limit instruction counting to a specific area of code, or the region-of-interest (ROI), add ROI markers to the loops source.
    To indicate where to start counting, add the __START_TRACE() marker. To indicate where to stop counting, add __STOP_TRACE.
    For example, to wrap the first loop of the loops code in ROI markers, use:

    #define N 42
    int a[N], b[N], c[N];
    
    #define __START_TRACE() { asm volatile (".inst 0x2520e020"); }
    #define __STOP_TRACE() { asm volatile (".inst 0x2520e040"); }
    
    int main(void) {
      __START_TRACE();
    
      a[0] = 0;
      b[0] = 1;
      c[0] = a[0] + b[0];
    
    for(int i=0; i<N; ++i)
        c[i] = i;
    
      __STOP_TRACE();
    
      for(int i=0; i<N; ++i)
        a[i] = b[i] + b[c[i]];
    }
    
  3. Build the new binary, call it first_loop.
  4. Run first_loop with the libinscount_emulated.so client:

    $ armie -msve-vector-bits=512 -i libinscount_emulated.so -a -roi ./first_loop
    Client inscount is running
    31 instructions executed of which 16 were emulated instructions
    $

    Notice the difference from the loops run:

    • Only the first loop has been instrumented and as a result fewer executed instructions have been counted at runtime.
    • The armie command includes the -a -roi option to inform the libinscount_emulated.so client to enable and disable instruction counting based on the __START_TRACE() and __STOP_TRACE() macros. Without the -a -roi option, the client will ignore the macros and count all instructions producing the same output as for the loops run above:

      $ armie -msve-vector-bits=512 -i libinscount_emulated.so ./first_loop
      Client inscount is running
      89539 instructions executed of which 36 were emulated instructions
      $

      The -a option is a new feature introduced in ArmIE 20.0 to enable you to pass command line arguments to instrumentation clients. In this case the argument is -roi but it can be any string which the client can use to adjust its behavior at execution time. Run armie --help for a description of the -a option or see the ‘Options’ section in the ArmIE Command Reference.

  5. Next, the second loop is analyzed. Move the __START_TRACE() and __STOP_TRACE markers to surround the second for loop:
    #define N 42
    int a[N], b[N], c[N];
    
    #define __START_TRACE() { asm volatile (".inst 0x2520e020"); }
    #define __STOP_TRACE() { asm volatile (".inst 0x2520e040"); }
    
    int main(void) {
    
      a[0] = 0;
      b[0] = 1;
      c[0] = a[0] + b[0];
    
     for(int i=0; i<N; ++i)
        c[i] = i;
    
      __START_TRACE();
    
      for(int i=0; i<N; ++i)
        a[i] = b[i] + b[c[i]];
    
      __STOP_TRACE();
    }
    
  6. Build the new binary, call it second_loop.
  7. Run and analyze the second_loop binary:

    $ armie -msve-vector-bits=512 -i libinscount_emulated.so -a -roi ./second_loop
    Client inscount is running
    31 instructions executed of which 20 were emulated instructions
    $

    In this run, more SVE instructions are executed than for the first_loop run because of the extra vector load and arithmetic instructions in the second loop.

The source code is in the ArmIE installation’s samples directory. You can modify these clients for your own custom analysis requirements. For examples and guidance, see the ‘Instrumentation guides’ section in Tutorials.

 

Next steps

  • Further instrumentation clients are available, that provide different insights, including:

    • inscount_emulated.cpp
    • instrace_emulated.c
    • meminstrace_emulated.c
    • memtrace_emulated.c
    • opcodes_emulated.cpp

    These are ROI-capable and their source code is in the ArmIE installation samples directory:
    /path/to/your/arm-instruction-emulator-<xx.y>_Generic-AArch64_<OS>_aarch64-linux/samples/
    You can modify and enhance these clients for your specific analysis requirements. For examples and guidance on how to do this, see Building Custom Analysis Instrumentation and the Instrumentation guides section in Tutorials.
  • For more advanced analysis examples of a real-world application, see Emulating SVE on existing Armv8-A hardware using DynamoRIO and ArmIE. This includes use-case examples of libopcodes_emulated.so and libmemtrace_simple.so.
  •  

Related information