Arm Performance Reports

About I/O behavior with Arm Performance Reports

Many HPC workloads have an increasingly significant I/O component. However, the performance of I/O systems varies widely between clusters, and over time. After the system is installed, rapidly identifying and diagnosing I/O bottlenecks is no longer possible.
This section demonstrates how Arm Performance Reports show the I/O behavior of the MADbench2 benchmark using real-world Cosmic Microwave Background (CMB) out-of-core processing workloads.

Workload: low process load

In this example, a high specification system runs a low process load.

System configuration:

  • I/O processes: Four
  • System cores: 12
  • Hard drive: HDD

The I/O breakdown section shows that one-third of the time is spent reading, at around 1 Gb/s, which is a high I/O figure for a single server.

The Memory section shows that only 14% of the total memory is in use by the application. Each process only uses a few hundred megabytes of memory which means that the disk cache can hold all the data being read with a large amount of unused cache space remaining.

Two thirds of the time is spent in write operations, which use 340 Mb/s of memory and is a fairly typical amount to expect. This is a good level of performance for a single node, although many HPC clusters with networked file systems can achieve better results.  

The next example presents a challenge to the system. 

Workload: medium process load

In this example, MPI communication occupies a substantial amount of the processing time.

System configuration:

  • I/O processes: Nine
  • System cores: 12
  • Hard drive: HDD

The MPI breakdown shows that all the processing time is occupied by collective calls, with a very low transfer rate of around 40 bytes per second. Low rates such as these usually indicate that MPI_Barrier calls are causing many processes to wait in an idle state and show that there is a workload imbalance.

The I/O breakdown provides some information about the cause of the imbalance. The amount of time spent in writes has increased, and the effective write rate has dropped by a factor of 6, to just 70 Mb/s. The code in this example runs a heavier load than in the previous example and the cause it is most likely to be the heavy contention for the file system. Contention affects many I/O-heavy HPC workloads, and it is not always clear in advance which communication patterns cause this contention. Arm Performance Reports flag up contention issues quickly.

Running a smaller group of MPI processes to perform disk writes improves the code on this system and resolves the contention.

Workload: high process load

With the addition of five more processes taking the total to 16, the system begins to struggle with performance. Hyperthreading further increases the number of processes to 24. 

System configuration:

  • I/O processes: 16
  • System cores: 12
  • Hard drive: HDD

At this load, less than 5% of the total duration of the application run is spent computing and the rest is spent writing to disk or waiting for other processes to finish writing to disk.

As in the previous example, the collective MPI transfer rate indicates load imbalance, and the I/O operations show that heavy write performance dominates the application run. Write speed drops to 7 Mb/s with 16 processes, compared to the much faster rate of 340 Mb/s with 4 processes in the previous example.

The impact on read performance is relatively small which is typical during high contention, because data reads can be served from cache. Data writes must eventually impact the disk and can do so with inefficient orderings.

Note: The per-process read rate has dropped to 270 Mb/s with 16 processes from 1 Gb/s with 4 processes. This suggests that system resources are pushed to their maximum limit at an aggregate read rate of 4 Gb/s from the disk cache, and is a reasonable result under these circumstances.

Workload: other system specifications

System behavior can vary widely depending on the type of hardware, requiring different solutions to achieve optimum performance capabilities. In the following example, the system runs on an SSD hard drive which presents less load balance issues than presented on an HHD hard drive./p>

System configuration:

  • I/O processes: Four
  • System cores: Four
  • Hard drive: SSD

In this example, the code that runs is the same as in the first example (Workload: low process load), on a laptop equipped with fewer cores than the previous examples, and an SSD drive.

Unlike the high-contention servers in the earlier examples, which were using HDDs, the MPI breakdown shows that load imbalance has much less of an impact on this system. The same failure mode encountered on an HHD drive does not apply when writing to an SSD drive, because all writes have the same low access time. In contrast, high-contention writing to a spinning HHD drive is typically dominated by access times and associated delays.

The consumer-grade SSD is saturated, although with a per-process write rate of 88 Mb/s, which translates to an aggregate of 350 Mb/s for the device.

The recommended approach to improve performance on this system is to add more nodes and spread the I/O out. Improvement costs are best spent, in this scenario, on high-bandwidth I/O, not on a faster CPU.