Profile your application for performance issues

This video demonstrates how to:

  • Prepare an application for profiling

  • Identify workload imbalances with Arm MAP

  • Optimize the code with parallel I/O functions.

Application used in this video

The sample application used in this video performs matrix multiplication by the following calculation:

C = A x B + C

A, B and C are double-precision square matrices. 

The algorithm is written in parallel using MPI, and includes 1 master process and 7 worker processes.

  • The master process initializes matrices A, B and C.

  • The master process sends the entire matrix B, along with slices of A and C, to the worker processes.

  • The master and worker processes perform the matrix multiplication function, and each process computes a slice of C.

  • The master process retrieves all slices of C and reconstitutes the results into matrix C.

The matrix multiplication process includes the master process (green) and the worker processes (orange and blue).

Master process  
 Slave process 1  
 Slave process n