You copied the Doc URL to your clipboard.

6 Interpreting performance reports

This section takes you through interpreting the reports produced by Arm Performance Reports.

Reports are generated in both HTML and textual formats for each run of your application by default. The same information is presented in both.

If you wish to combine Arm Performance Reports with other tools, consider using the CSV output format.

See 6.12 for more details.

6.1 HTML performance reports

Viewing HTML files is best done on your local machine. Many sites have places you can put HTML files to be viewed from within the intranet. These directories are a good place to automatically send your performance reports to. Alternatively, you can use scp or even the excellent sshfs to make the reports available to your laptop or desktop:


    $ scp login1:arm/reports/examples/wave_c_4p*.html . 
$ firefox wave_c_4p*.html

The following report was generated by a 8 MPI processes and 2 OpenMP threads per process run of the wave_openmp.c example program on a typical HPC cluster:

PIC

Figure 4: A performance report for the wave_openmp.c example

Your report may differ from this one depending on the performance and network architecture of the machine it is run on, but the basic structure of these reports is always the same. This makes comparisons between reports simple, direct and intuitive. Each section of the report is described in the following sections.

6.2 Report summary

This characterizes how the application's wallclock time was spent, broken down into compute, MPI and I/O.

In this example file you see that Arm Performance Reports has identified the program as being compute-bound, which simply means that most of its time is spent inside application code rather than communicating or using the filesystem.

The snippets of advice, such as "this code may benefit from running at larger scales" are good starting points for guiding future investigations and are designed to be meaningful to scientific users with no previous MPI tuning experience.

The triangular radar chart in the top-right corner of the report reflects the values of these three key measurements: compute, MPI and I/O. It is helpful to recognize and compare these triangular shapes when switching between multiple reports.

6.2.1 Compute

Time spent computing. This is the percentage of wall-clock time spent in application and in library code, excluding time spent in MPI calls and I/O calls.

6.2.2 MPI

Time spent communicating. This is the percentage of wall-clock time spent in MPI calls such as MPI_Send, MPI_Reduce and MPI_Barrier.

6.2.3 Input/Output

Time spent reading from and writing to the filesystem. This is the percentage of wall-clock time spent in system library calls such as read, write and close.

Note

All time spent in MPI-IO calls is included here, even though some communication between processes may also be performed by the MPI library. MPI_File_close is treated as time spent writing, which is often but not always correct.

6.3 CPU breakdown

Note

All of the metrics described in this section are only available on x86_64 systems.

This section breaks down the time spent in application and library code further by analyzing the kinds of instructions that this time was spent on.

Note that all percentages here are relative to the compute time, not to the entire application run. Time spent in MPI and I/O calls is not represented inside this section.

6.3.1 Single core code

The percentage of wall-clock in which the application executed using only one core per process, as opposed to multithreaded or OpenMP code. If you have a multithreaded or OpenMP application, a high value here indicates that your application is bound by Amdahl's law and that scaling to larger numbers of threads will not meaningfully improve performance.

6.3.2 OpenMP code

The percentage of wall-clock time spent in OpenMP regions. The higher this is, the better. This metric is only shown if the program spent a measurable amount of time inside at least one OpenMP region.

6.3.3 Scalar numeric ops

The percentage of time spent executing arithmetic operations such as add, mul, div. This does not include time spent using the more efficient vectorized versions of these operations.

6.3.4 Vector numeric ops

The percentage of time spent executing vectorized arithmetic operations such as Intel's SSE2 / AVX extensions.

Generally it is good if a scientific code spends most of its time in these operations, as that is the only way to achieve anything close to the peak performance of modern processors.

If this value is low it is worth checking the compiler's vectorization report to understand why the most time consuming loops are not using these operations. Compilers need a good deal of help to efficiently vectorize non-trivial loops and the investment in time is often rewarded with 2x-4x performance improvements.

6.3.5 Memory accesses

The percentage of time spent in memory access operations, such as mov, load, store. A portion of the time spent in instructions using indirect addressing is also included here. A high figure here shows the application is memory-bound and is not able to make full use of the CPU resources. Often it is possible to reduce this figure by analyzing loops for poor cache performance and problematic memory access patterns, boosting performance significantly.

A high percentage of time spent in memory accesses in an OpenMP program is often a scalability problem. If each core is spending most of its time waiting for memory, even the L3 cache, then adding further cores rarely improves matters. Equally, false sharing in which cores block attempts to access the same cache lines and the over-use of the atomic pragma show up as increased time spent in memory accesses.

6.3.6 Waiting for accelerators

The percentage of time that the CPU is waiting for the accelerator.

6.4 OpenMP breakdown

This section breaks down the time spent in OpenMP regions into computation and synchronization and includes additional metrics that help to diagnose OpenMP performance problems. It is only shown if a measurable amount of time was spent inside OpenMP regions.

6.4.1 Computation

The percentage of time threads in OpenMP regions spent computing as opposed to waiting or sleeping. Keeping this high is one important way to ensure OpenMP codes scale well. If this is high then look at the CPU breakdown to see whether that time is being well spent on, for example, floating-point operations, or whether the cores are mostly waiting for memory accesses.

6.4.2 Synchronization

The percentage of time threads in OpenMP regions spent waiting or sleeping. By default, each OpenMP region ends with an implicit barrier. If the workload is imbalanced and some threads are finishing sooner and waiting then this value will increase. Equally, there is some overhead associated with entering and leaving OpenMP regions and a high synchronization time may suggest that the threading is too fine-grained. In general, OpenMP performance is better when outer loops are parallelized rather than inner loops.

6.4.3 Physical core utilization

Modern CPUs often have multiple logical cores for each physical cores. This is often referred to as hyper-threading. These logical cores may share logic and arithmetic units. Some programs perform better when using additional logical cores, but most HPC codes do not.

If the value here is greater than 100 then OMP_NUM_THREADS is set to a larger number of threads than physical cores are available and performance may be impacted, usually showing up as a larger percentage of time in OpenMP synchronization or memory accesses.

6.4.4 System load

The number of active (running or runnable) threads as a percentage of the number of physical CPU cores present in the compute node. This value may exceed 100% if you are using hyper-threading, if the cores are oversubscribed, or if other system processes and daemons start running and take CPU resources away from your program. A value consistently less than 100% may indicate your program is not taking full advantage of the CPU resources available on a compute node.

6.5 Threads breakdown

This section breaks down the time spent by worker threads (non-main threads) into computation and synchronization and includes additional metrics that help to diagnose multicore performance problems. This section is replaced by the OpenMP Breakdown if a measurable amount of application time was spent in OpenMP regions.

6.5.1 Computation

The percentage of time worker threads spent computing as opposed to waiting in locks and synchronization primitives. If this is high then look at the CPU breakdown to see whether that time is being well spent on, for example floating-point operations, or whether the cores are mostly waiting for memory accesses.

6.5.2 Synchronization

The percentage of time worker threads spend waiting in locks and synchronization primitives. This only includes time in which those threads were active on a core and does not include time spent sleeping while other useful work is being done. A large value here indicates a performance and scalability problem that should be tracked down with a multicore profiler such as Arm MAP.

6.5.3 Physical core utilization

Modern CPUs often have multiple logical cores for each physical core. This is often referred to as hyper-threading. These logical cores may share logic and arithmetic units. Some programs perform better when using additional logical cores, but most HPC codes do not.

The value here shows the percentage utilization of physical cores. A value over 100% indicates that more threads are executing than there are physical cores, indicating that hyper-threading is in use.

A program may have dozens of helper threads that do little except sleep and these will not be shown here. Only threads actively and simultaneously consuming CPU time are included in this metric.

6.5.4 System load

The number of active (running or runnable) threads as a percentage of the number of physical CPU cores present in the compute node. This value may exceed 100% if you are using hyper-threading, if the cores are oversubscribed, or if other system processes and daemons start running and take CPU resources away from your program. A value consistently less than 100% may indicate your program is not taking full advantage of the CPU resources available on a compute node.

6.6 MPI breakdown

This section breaks down the time spent in MPI calls reported in the summary. It is only of interest if the program is spending a significant amount of its time in MPI calls in the first place.

All the rates quoted here are inbound + outbound rates. That is, the rate of communication from the process to the MPI API, and not of the underlying hardware directly, is being measured.

This application-perspective is found throughout Arm Performance Reports and in this case allows the results to capture effects such as faster intra-node performance, zero-copy transfers and so on.

Note that for programs that make MPI calls from multiple threads (MPI is in MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE mode) Arm Performance Reports will only display metrics for MPI calls made on the main thread.

6.6.1 Time in collective calls

The percentage of time spent in collective MPI operations such as MPI_Scatter, MPI_Reduce and MPI_Barrier.

6.6.2 Time in point-to-point calls

The percentage of time spent in point-to-point MPI operations such as MPI_Send and MPI_Recv.

6.6.3 Effective process collective rate

The average transfer per-process rate during collective operations, from the perspective of the application code and not the transfer layer. That is, an MPI_Alltoall that takes 1 second to send 10 Mb to 50 processes and receive 10 Mb from 50 processes has an effective transfer rate of 10x50x2 = 1000 Mb/s.

Collective rates can often be higher than the peak point-to-point rate if the network topology matches the application's communication patterns well.

6.6.4 Effective process point-to-point rate

The average per-process transfer rate during point-to-point operations, from the perspective of the application code and not the transfer layer. Asynchronous calls that allow the application to overlap communication and computation such as MPI_ISend are able to achieve much higher effective transfer rates than synchronous calls.

Overlapping communication and computation is often a good strategy to improve application performance and scalability.

6.7 I/O breakdown

This section breaks down the amount of time spent in library and system calls relating to I/O, such as read, write and close. I/O due to MPI network traffic is not included. In most cases this should be a direct measure of the amount of time spent reading and writing to the filesystem, whether local or networked.

Some systems, such as the Cray X-series, do not have I/O accounting enabled for all filesystems. On these systems only Lustre I/O is reported in this section.

6.7.1 Time in reads

The percentage of time spent on average in read operations from the application's perspective, not the filesystem's perspective. Time spent in the stat system call is also included here.

6.7.2 Time in writes

The percentage of time spent on average in write and sync operations from the application's perspective, not the filesystem's perspective.

Opening and closing files is also included here, as measurements have shown that the latest networked filesystems can spend significant amounts of time opening files with create or write permissions.

6.7.3 Effective process read rate

The average transfer rate during read operations from the application's perspective. A cached read will have a much higher read rate than one that has to hit a physical disk. This is particularly important to optimize for as current clusters often have complex storage hierarchies with multiple levels of caching.

6.7.4 Effective process write rate

The average transfer rate during write and sync operations from the application's perspective. A buffered write will have a much higher write rate than one that has to hit a physical disk, but unless there is significant time between writing and closing the file the penalty will be paid during the synchronous close operation instead. All these complexities are captured in this measurement.

6.8 Memory breakdown

Unlike the other sections, the memory section does not refer to one particular portion of the job. Instead, it summarizes memory usage across all processes and nodes over the entire duration. All of these metrics refer to RSS, that is physical RAM usage, and not virtual memory usage. Most HPC jobs attempt to stay within the physical RAM of their node for performance reasons.

6.8.1 Mean process memory usage

The average amount of memory used per-process across the entire length of the job.

6.8.2 Peak process memory usage

The peak memory usage seen by one process at any moment during the job. If this varies greatly from the mean process memory usage then it may be a sign of either imbalanced workloads between processes or a memory leak within a process.

Note

This is not a true high-watermark, but rather the peak memory seen during statistical sampling. For most scientific codes this is not a meaningful difference as rapid allocation and deallocation of large amounts of memory is universally avoided for performance reasons.

6.8.3 Peak node memory usage

The peak percentage of memory seen used on any single node during the entire run. If this is close to 100% then swapping may be occurring, or the job may be likely to hit hard system-imposed limits. If this is low then it may be more efficient in CPU hours to run with a smaller number of nodes and a larger workload per node.

6.9 Accelerator breakdown

PIC

Figure 5: Accelerator metrics report

This section shows the utilization of NVIDIA CUDA accelerators by the job.

6.9.1 GPU utilization

The average percentage of the GPU cards working when at least one CUDA kernel is running.

6.9.2 Global memory accesses

The average percentage of time that the GPU cards were reading or writing to global (device) memory.

6.9.3 Mean GPU memory usage

The average amount of memory in use on the GPU cards.

6.9.4 Peak GPU memory usage

The maximum amount of memory in use on the GPU cards.

6.10 Energy breakdown

PIC

Figure 6: Energy metrics report

This section shows the energy used by the job, broken down by component, for example CPU and accelerators.

6.10.1 CPU

The percentage of the total energy used by the CPUs.

CPU power measurement requires an Intel CPU with RAPL support, for example Sandy Bridge or newer, and the intel_rapl powercap kernel module to be loaded.

6.10.2 Accelerator

The percentage of energy used by the accelerators. This metric is only shown when a CUDA card is present.

6.10.3 System

The percentage of energy used by other components not shown above. If CPU and accelerator metrics are not available the system energy will be 100%.

6.10.4 Mean node power

The average of the mean power consumption of all the nodes in Watts.

6.10.5 Peak node power

The node with the highest peak of power consumption in Watts.

6.10.6 Requirements

CPU power measurement requires an Intel CPU with RAPL support, for example Sandy Bridge or newer, and the intel_rapl powercap kernel module to be loaded.

Node power monitoring is implemented via one of two methods: the Arm IPMI energy agent which can read IPMI power sensors, or the Cray HSS energy counters.

For more information on how to install the Arm IPMI energy agent please see G.4 Arm IPMI Energy Agent. The Cray HSS energy counters are known to be available on Cray XK6 and XC30 machines.

Accelerator power measurement requires a NVIDIA GPU that supports power monitoring. This can be checked on the command-line with nvidia-smi -q -d power. If the reported power values are reported as "N/A", power monitoring is not supported.

6.11 Textual performance reports

The same information is presented as in 6.1 HTML performance reports, but in a format better suited to automatic data extraction and reading from a terminal:


   Command:     mpiexec -n 16 examples/wave_c 60 
Resources: 1 node (12 physical, 24 logical cores per node, 2 GPUs per node available)
Memory: 15 GB per node, 11 GB per GPU
Tasks: 16 processes
Machine: node042
Started on: Tue Feb 25 12:14:06 2014
Total time: 60 seconds (1 minute)
Full path: /global/users/mark/arm/reports/examples
Notes:

Summary: wave_c is compute-bound in this configuration
Compute: 82.4% |=======|
MPI: 17.6% |=|
I/O: 0.0% |
This application run was compute-bound. A breakdown of this time and advice for investigating further is found in the compute section below.
As little time is spent in MPI calls, this code may also benefit from running at larger scales.
...

A combination of grep and sed can be useful for extracting and comparing values between multiple runs, or for automatically placing such data into a centralized database.

6.12 CSV performance reports

A CSV (comma-separated values) output file can be generated using the --output argument and specifying a filename with the .csv extension:


   perf-report --output=myFile.csv ...

The CSV file will contain lines in a NAME, VALUE format for each of the reported fields. This is convenient for passing to an automated analysis tool, such as a plotting program. It can also be imported into a spreadsheet for analyzing values among executions.

6.13 Worked examples

The best way to understand how to use and interpret performance reports is by example. You can download several sets of real-world reports with analysis and commentary from the Arm Developer website.

At the time of writing there are three collections available, which are described in the following sections.

6.13.1 Code characterization and run size comparison

A set of runs from well-known HPC codes at different scales showing different problems:

Characterization of HPC codes and problems

6.13.2 Deeper CPU metric analysis

A look at the impact of hyper-threading on the performance of a code as seen through the CPU instructions breakdown:

Exploring hyperthreading

6.13.3 I/O performance bottlenecks

The open source MAD-bench I/O benchmark is run in several different configurations, including on a laptop, and the performance implications analyzed:

Understanding I/O behavior

Was this page helpful? Yes No