Debugging OpenACC Codes with Arm DDT

OpenACC is a directive-based programming model that enables offloading code from C, C++, and Fortran to attached accelerators. The OpenACC model uses an architecture and syntax similar to OpenMP which frees the developer from the need to initialize the accelerator or manage data transfer. For GPUs in particular, the model provides a framework that greatly simplifies porting or creating programs to use GPU accelerators.

Even with simplified development models there will always be difficult bugs.  Fortunately, Arm DDT enables developers to debug OpenACC codes with the same ease as CPU only codes.  Some of the tasks you can perform with DDT include stepping though accelerator code, setting breakpoints, and using the multi-dimensional array viewer to inspect data on the device.

This tutorial shows how to modify a simple matrix multiplication function to use a GPU with OpenACC, and how to use Arm DDT to inspect this code.

Matrix multiplication with OpenACC 

The following image illustrates the matrix multiplication with OpenACC pragmas to enable use of the GPU. The pragmas specify how data is transferred to and from the GPU, to define work that can be done in parallel, and the type of parallelism present in kernels. Rewriting this algorithm to native CUDA could potentially require dozens of new lines and require a deep understanding of the GPU architecture.

Modifying a single matrix multiplication function

Programming for the accelerator presents a unique set of opportunities for a new class of bugs to arise:

  • Data must be transferred to and from the device.  OpenACC automates and hides much of this, but problems can still occur.  Use the DDT array viewer to see that data are the same on the host and on the device.
  • Accelerators often have a different memory design, and less memory available. Use the DDT Memory Usage dialog box to spot memory usage issues.
  • Accelerators typically use many more threads than a CPU, and they operate these threads differently. Use the DDT Stacks view to understand thread behavior exactly.

OpenACC program loaded in Arm DDT

In the figure below, a breakpoint was set in the accelerator code and the blue shaded lines show the location of various threads. A look at the Stacks view shows where all of the threads are located in the execution path. At this point, you can also set breakpoints in the same way that you set them on the CPU, and DDT still behaves in the same manner. DDT has added the CUDA Threads widget to the top of the GUI. Modifying the block or thread value changes the specific thread under observation. You can confirm this by looking at the i, j, and k values in the Locals view on the right.

OpenAcc code loaded using Arm DDT

Accelerator programming requires transfer of data to and from the device and can be a common source of errors.  With the DDT array viewer, you can inspect data on the host and the GPU to ensure that data movement is performed correctly.

DDT Array Viewer

With OpenACC and Arm DDT, scientists and engineers have a powerful tool set to take advantage of accelerators.  Porting time is reduced, code complexity is minimized, and debugging is greatly simplified.