Characterizing HPC codes with Arm Performance Reports

In the following examples, we’re going to look at four different runs of CP2K and HPL at 256 processes, both with and without problems. Each case has its own unique pattern visible with Arm Performance Reports.

Let's study them in turn to understand how these reports can be used to characterize applications and diagnose abnormal behavior.

CP2K molecular dynamics simulation code

The proportion of time spent waiting for memory accesses is high, so improving this is one optimization target. A source-level profiler such as Arm Map can help. The 50 Mb/s point-to-point MPI communication rate suggests either lots of small inefficient messages or significant load imbalance. This run only used 7% of the peak node memory.

The easiest way to improve efficiency and use fewer CPU hours is to run such jobs at a lower scale, which will also reduce the amount of time spent in MPI communications. An 8x reduction in MPI processes down to just 32 would be a good start.

Linpack benchmark

This Linpack benchmark run illustrates why increasing numbers of HPC sites believe it no longer represents a useful real-world measure of system performance. Unlike the complicated real world of molecular dynamics, it’s possible to achieve extremely high speed up factors with linear algebra. Here, we see that, with 92% of time being spent in application code.

So what can we learn from this code?

Well, despite being highly-optimized this code still spends almost half its time waiting for main memory and only a quarter of its time using the maximum-bandwidth SSE/SEE2/AVX instructions.

Vendor-specific implementations of HPL often achieve higher values here, and they’ve done so by carefully examining the key loops and reordering or in some cases replacing parts of them with hand-written assembly.

Compiler auto-vectorization isn’t nearly as good as people generally like to believe!

Different results when running the same code on the same machine

In the MPI section we see all the time is being spent in collective calls instead of point-to-point calls in the previous report. And the effective transfer rate is 0 bytes/s. This suggests many processes are waiting for long periods of time (at collective barriers, for example). Why could that be? A severe imbalance in the workload?

The memory section holds another clue. The mean per-process memory usage is almost four times lower than the peak per-process usage.

This suggests the majority of the processes are working on substantially smaller data sets than others. Did we make a mistake configuring the input file for the run? A review of the input settings (helpfully captured in the “Notes” field) shows the block sizes are set up for 64 processes. HPL has happily computed on only 64 of our 256 allocated processes, silently leaving the others idling in MPI_Finalize.

Simply looking at the time of the run and the output from HPL could have led us to believe that we’d reached the scaling limits or to blame this on “system networking issues”.

By looking inside the application with an Allinea Performance Report can we rapidly deduce the true failure mode - user error!

The only thing more complex than a mature HPC code is its build system

There are a lot of opportunities for poorly-optimized code to make its way into a production executable, from mis-compiled libraries to runtime bound-checking. In this example, we see how one poorly-compiled module affects a performance report.

The summary instantly shows something has changed, and in the CPU breakdown we can see the effect of the compiler flags directly.

Although only one module of the code was compiled incorrectly, the amount of time spent in vectorized instructions has dropped by two thirds and the amount of time spent in memory accesses has increased to 69% of the total.

As Arm Performance Reports are generated in both text and HTML formats, many people choose to run them as part of their regular regression tests during code development.

It’s very easy for a code change to accidentally make a key loop unvectorizable or to have unforseen implications for cache behavior, but the metrics in an Arm Performance Report flag up and identify any changes immediately.

How do your applications match the hardware they’re running on? Are they configured optimally?