Analyzing SVE programs with Arm Instruction Emulator

Running an SVE program as described in Getting started with Arm Instruction Emulator verifies that the code you have developed will be able to run on SVE hardware when it becomes available. However, if you are developing high-performance programs, some form of runtime analysis is required to gain insights into their execution behavior. This enables you to identify heavily-used loops and instruction sequences so that improvements can be made to execution speed and memory access.

ArmIE uses DynamoRIO to emulate and instrument SVE binaries on AArch64 hardware. DynamoRIO is a publicly available dynamic binary instrumentation (DBI) tool platform which supports x86 and Arm binaries. It provides an API which enables you to write your own binary level runtime instrumentation, as well as supplying some example instrumentation. The ArmIE release is integrated with a stable version of DynamoRIO which you can download as one seamless package. 

ArmIE provides a set of instrumentation clients which can be used to analyze SVE binaries at runtime. The term 'instrumentation client' in this context refers to the way ArmIE uses DynamoRIO to work as an analysis tool as well as an emulator. ArmIE is invoked with an instrumentation client and the SVE binary to be emulated and analyzed. The client is simply a shared object file which uses the DynamoRIO API to capture and process desired runtime events.

Procedure

This example illustrates the use of a very basic instrumentation client, which counts native AArch64 and emulated SVE instructions.

  1. The following command invokes ArmIE with an instrumentation client named libinscount_emulated.so and runs with the example binary:
    $ armie -msve-vector-bits=128 -i libinscount_emulated.so -- ./example

    This returns:
    Client inscount is running
    SVE: 0x00000000004006c8 0x25a91fe0
    SVE: 0x00000000004006d0 0xa54842a0
    SVE: 0x00000000004006d4 0xa54842c1
    SVE: 0x00000000004006d8 0x04a10400
    SVE: 0x00000000004006dc 0xe54842e0
    SVE: 0x00000000004006e0 0x04b0e3e8
    SVE: 0x00000000004006e4 0x25a91d00
    SVE: 0x00000000004006d0 0xa54842a0
    SVE: 0x00000000004006d4 0xa54842c1
    SVE: 0x00000000004006d8 0x04a10400
    SVE: 0x00000000004006dc 0xe54842e0
    SVE: 0x00000000004006e0 0x04b0e3e8
    SVE: 0x00000000004006e4 0x25a91d00
    i       a[i]    b[i]    c[i] ============================= 0       197     283     86 1       262     277     15 2       258     293     35 3       194     286     92 . . . 1019    243     290     47 1020    185     261     76 1021    165     234     69 1022    232     295     63 1023    204     235     31 2134094 instructions executed of which 1537 were emulated instructions $

    Notice the difference in output from the example shown in Getting started with Arm Instruction Emulator (see section Compile, vectorize and run a program with SVE code) which did not use -i libinscount_emulated.so. The additional information is what the instrumentation client libinscount_emulated.so outputs as part of its analysis of the example binary as it is running:

    Client inscount is running
    SVE: 0x00000000004006c8 0x25a91fe0
    ...
    2134094 instructions executed of which 1537 were emulated instructions
  2. The above example ran with 128-bit vectors and the instruction count does not provide much insight. But suppose we were interested in the effect vector length has on the number of SVE instructions executed for example.c, because we want to minimize them and help reduce time spent in execution. First, run with each vector length and tabulate the results:

    Vector Length 128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048
    SVE Instructions 1537 769 517 385 313 259 223 193 175 157 145 133 121 115 109 97
  3. Then plot the results on a line graph:
    The effect vector length has on the number of SVE instructions executed in this example
    The graph shows us that the largest reduction in SVE instructions executed occurs between 128 and about 512 bits. This type of analysis of an application's runtime behavior can be used with other types of analysis to study the impact of vector length on performance.

Next steps

  • Further instrumentation clients are available, that provide different insights, including counting opcodes libopcodes_emulated.so and memory tracing libmemtrace_simple.so.
  • For more advanced analysis examples of a real-world application, see Emulating SVE on existing Armv8-A hardware using DynamoRIO and ArmIE. This includes use-case examples of libopcodes_emulated.so and libmemtrace_simple.so.
  • Source code for the instrumentation clients is provided with the Arm Instruction Emulator build, in the samples directory:
    /path/to/your/arm-instruction-emulator-<xx.y>_Generic-AArch64_<OS>_aarch64-linux/samples/

    The files in this directory are inscount_emulated.cpp, opcodes_emulated.cpp and memtrace_simple.c. You can modify and enhance these for your specific analysis. See Building Custom Analysis Instrumentation for instructions on how to do this.

Related information