The High Performance Conjugate Gradient HPCG benchmark is designed to compliment the High Performance LINPACK HPL benchmark in order to better characterize HPC systems. While HPL focuses on floating point operations as a metric, HPCG stresses the memory access patterns. HPCG focuses on sparse matrix-vector multiplication to solve a linear system of equations using the conjugate gradient method. This method is widely used in the scientific computing community, particularly in the area of computational fluid dynamics.

HPCG 3.1

Build Procedure

Clone the repository

The HPCG repository can be cloned from the Arm GitLab mirror:

git clone https://gitlab.com/arm-hpc/benchmarks/hpcg HPCG
cd HPCG

Configure

The setup directory contains a number of config files. It is recommended that you copy an existing setup file and then modify the Makefile rules and flags (e.g. MPI, OpenMP, and any compiler flags).

The files Make.GCC_OMP, Make.Linux_MPI, Make_Linux_serial, and Make.MPI_GCC_OMP should work for systems with default library installations. They may not have optimized compiler flags, however. To use a different compiler such as the Arm Compiler for Linux for example, create setup/Make.MPI_Arm_OMP based on setup/Make.MPI_GCC_OMP:

cp setup/Make.MPI_GCC_OMP setup/Make.MPI_Arm_OMP

And set the CXXFLAGS to:

CXXFLAGS     = $(HPCG_DEFS) -O3 -mcpu=native -ffp-contract=fast -fsimdmath -fopenmp

Make a directory for the build and configure:

mkdir build
cd build
../configure MPI_Arm_OMP

Build

Run make in the build directory:

make -j

Test

By default HPCG will use the settings specified in bin/hpcg.dat. This sets the matrix size to 104^3. For testing purposes a smaller system size can be specified at the command line. It is important to set the number of OpenMP threads through the environemnt variable OMP_NUM_THREADS. For example, on 16 cores with one OpenMP thread per MPI task :

OMP_NUM_THREADS=1 mpirun -np 16 ./bin/xhpcg 32 24 16

Optimization

By using different build directories, it is easy to track how various optimizations affect the performance of HPCG. The figure of merit is GFLOP/s. This is reported in a unique final after each run labeled HPCG-Benchmark_{Version}_{Date}_{Time}.txt

Compiler Flags

The first step in optimizing HPCG is through compiler flags in the configuration file. For example, using GCC 11.2 in serial mode, HPCG converges correctly using:

CXXFLAGS     = $(HPCG_DEFS) -fomit-frame-pointer -Ofast -mcpu=native -funroll-loops

Modifying source code

HPCG encourages modifying the source code. For each numerical routine, there are two files. One is a reference version which should not be changed. The other can be modified as much as desired to achieve better performance. For example, the dot product routine is calculated in ComputeDotProduct.cpp and ComputeDotProduct_ref.cpp. If no optimizations are made, the ComputeDotProduct.cpp will simply call the reference routine:

int ComputeDotProduct(const local_int_t n, const Vector & x, const Vector & y,
    double & result, double & time_allreduce, bool & isOptimized) {

  // This line and the next two lines should be removed and your version of ComputeDotProduct should be used.
  isOptimized = false;
  return ComputeDotProduct_ref(n, x, y, result, time_allreduce);

Daniel Ruiz from Arm wrote two blogs describing how HPCG can be optimized for Arm HPCs located here and here.

The source code from all of Daniel's modifications can be cloned here based on HPCG version 3.0.

Adding Performance Libraries

As HPCG uses a matrix based problem, there are ample opportunities to replace naive operations with optimized library calls. The HPCG_for_Arm repository described above has options for using the Arm Performance Libraries for most basic linear algebra operations.

Comments

Please register or sign in to add a comment.