Arm Forge

Debugging and optimizing CUDA and OpenACC

Arm Forge is a development tool suite for developing, debugging and optimizing CUDA and OpenACC codes such as GeForce, Tesla and the Kepler K80. Arm Forge includes the Arm DDT for parallel and multi-process debugging, and Arm MAP for profiling.

Get a trial

Key CUDA support features

 Component  Features
  • Creates breakpoints in CUDA threads at specific lines of CUDA or OpenACC code.
  • Supports mixed CPU and GPU debugging, and multi-process code in the same debugging session.
  • Displays CPU and GPU threads using an Arm thread-consolidating parallel stack view to simplify the information and highlight the differences.
  • Includes dynamic mode, often used for recursive CUDA.
  • Shows all memory types, including register, shared (block) and global, unified virtual addressing, and GPU Direct.
  • Steps into/over warps, blocks and entire kernels.
  • Debugs CUDA core dumps with CUDA 7, and above.
  • Supports multiple GPUs simultaneously. 
  • Provides memory debugging for access errors and memory leak reporting for global memory.
  • Provides a view of memory transfers and global memory used.
  • Provides a view of GPU temperature as the job progresses.
  • Profiles line-level CPU code.
  • Displays a view for monitoring and analyzing the time the CPU threads spend waiting for CUDA kernels to complete.
Supported compilers
  • CUDA Fortran and F90, and OpenACC from Portland.
  • Cray OpenACC compiler.
  • Inline PTX.
  • CUDA toolkits (including nvcc), for CUDA versions 6.5, 7, 7.5 and 8.

CUDA debugging

Arm DDT fully supports applications produced in CUDA C, C++, Fortran and OpenACC.  Arm DDT is used for debugging these applications in industries that require high performance computing.

The following sections offer best practices on using Arm DDT for CUDA.

Set a breakpoint

You can set a breakpoint at any line of CUDA source code, in the same way that you can with a CPU debugger. When a block of CUDA threads reaches a specific line, the debugger pauses the whole application. 

Explore behavior

GPUs include substantial SIMD instructions and allow thousands of active threads to run simultaneously. The debugger enables you to identify and select a specific CUDA thread by its index or select a thread on a particular line of code.

Stepping a thread displays how a kernel progresses, although CUDA GPUs are slightly different to CPUs and execute threads in groups.

Note: Other GPU threads in the same warp (usually 32 threads) progress at the same time. Examine whether the thread you are monitoring moves through the code as intended.

Visualize  data

Each CUDA thread has its own register variables but can share other memory with threads in the same block, the whole device or even the host. Make sure that the data resides in the expected type of memory. While single values can be easily identified, it is helpful for identifying unusual values to visualize array data or to filter data.

Tip: Generate a second visualization to compare and check GPU data with the CPU copy. 

Verify memory usage

Before you start your application using Arm DDT, select the option to debug CUDA memory usage. Not all arrays are multiples of the warp-size and it is common to make an error that reads beyond an array in CUDA. While those errors are not always fatal, they do cause non-deterministic behavior that can lead to failure at unexpected times. The debugger identifies these errors so you can fix them to prevent issues.

Profiling and tuning CUDA and OpenACC applications

Use the Arm MAP profiler to examine application performance so that you can optimize the code to achieve prime performance. When you compute intensive code, Arm MAP can help to identify the lines of code which are most time intensive during execution of the host CPU code. 

Step 1: Profile the initial code

Use the Arm MAP profiler to discover the parts of your code that consume the most CPU time and determine what those lines of code do. If scalar or floating point operations dominate CPU time, GPU usage is likely to be high. However, if I/O, branching or communication dominate the CPU time, you must fix those issues first .

Step 2: Profile the results

When your optimized CUDA or OpenACC code is in operation, establish whether performance has improved, and identify the next target for optimization. If possible, reduce the frequency of data transfer between CPU and GPU to improve performance, by combining sequences of CPU operations into one large GPU usage.

Note: MAP can provide profile information about source lines or functions for CUDA kernels executed on the GPU, but not for OpenACC kernels.

Advanced CUDA: Overlap data transfer

MAP can get CUDA source line profiling information and show how the CPU and GPU work together.

In one approach to optimization, you can overlap GPU and CPU computation, which CUDA makes easy to do with streams. However, you still need to find out how much time is spent in data transfer. If there is too little time spent waiting at the synchronization stage, the GPU might finish quicker than the CPU, and needs further optimization. If too much time is spent at the synchronization stage, the CPU wastes cycles which could be utilized.