Exploring hyperthreading with Arm Performance Reports

A processor with hyperthreading enabled appears to have twice as many cores as usual, but those cores don’t behave in the same way as real cores. Let’s see what really happens to a code using hyperthreaded cores.

Simple 1-D wave equation solver running on a 16-core machine (32-cores with hyperthreading)

This 8 process run managed 639000 iterations. There are clear opportunities to improve it - 74% of time spent moving memory and 0% vectorized instructions suggests some time profiling would improve single-core performance by 5-10x relatively quickly.

Running with 16 processes follows the classic HPC advice to avoid hyperthreading

In this example, we can see that the CPU breakdown is still very similar to the 8-process performance, a sign that the per-core characteristics haven’t changed in this run. We see another inefficiency in this code - the MPI performance shows a poor point-to-point transfer rate! There could be a mistake somewhere in the code, such as assigning too few points to each process or communicating them one byte at a time.

Despite the increased communication overhead, this run achieved 1069300 iterations - around 67% more than with 8 processes. Throwing more hardware at the problem has worked so far. But can we get more out of this node by using the hyperthreaded cores?

If you weren’t aware that this node had hyperthreading enabled, this is exactly how you would use it - with one thread per logical core

However, now the solver is overwhelmingly MPI-bound. The effective point-to-point transfer rate has dropped to just 168k/s, which is unlikely to be caused by changes in network topology because everything is running on a single node.

Instead, what we’re seeing here is one of the problems with hyperthreading: the cores are not running simultaneously, so when rank A tries to communicate with rank B, rank B isn’t even running on a core but is waiting to be scheduled. This causes a lot of synchronization delays, slashing the already abysmal effective transfer rate to new lows.

So the performance of the run has actually dropped to just 710900 iterations. So should we always turn hyperthreading off?

Not so fast. Something magical has happened in the CPU breakdown! Now 41.5% of the time is spent on numeric operations and only 58.5% on memory accesses. That’s a lot better!

This is hyperthreading’s promise in action – allowing the CPU to make better use of its floating-point units by switching out threads waiting due to memory latency. Can we capitalize on this?

It seems crazy, right? We have 16 real cores and 32 hyperthreaded cores, so why would we submit a job with 24 processes per node? Isn’t that asking for trouble? Let’s look at the results!

Something very surprising has happened - compared with the 16 process run the amount of time in MPI calls has actually decreased, despite adding more processes! And the amount of time spent in numeric operations is still higher than our 8 or 16 process baselines, thanks to hyperthreading continuing to hide some of the memory latency issues this code has.

The reduced MPI time is also interesting.

The Performance Report shows that effective point-to-point performance was actually slightly higher on this run than on any other.

A quick look with Arm MAP shows that the processes at the start and end of the 1-D wave only communicate in one direction, which means they spend more time waiting than the rest.

Increasing the number of processes reduces the effect this has on the mean value.

Altogether this run managed an astonishing 1226400 iterations – an increase of 15% on the same hardware without changing a line of code.

How do your applications match the hardware they’re running on? Are they configured optimally?