Arm DDT is able to debug applications that use NVIDIA CUDA devices, with actual debugging of the code running on the GPU, simultaneously while debugging the host CPU code.
Arm supports a number of GPU compilers that target CUDA devices.
- NVCC-the NVIDIA CUDA compilers
- Cray OpenACC
- PGI CUDA Fortran and the PGI Accelerator Model
- IBM XLC/XLF with offloading support
In order to debug CUDA programs with Arm DDT, a CUDA-enabled license key is required, which is an additional option to default licenses. If CUDA is not included with a license, the CUDA options will be grayed-out on the run dialog of Arm DDT.
While debugging a CUDA program, an additional process from your license is used for each GPU. An exception to this is that single process licenses will still allow the debugging of a single GPU.
For NVIDIA's nvcc compiler, kernels must be compiled with the "-g -G" flags. This enables generation of information for debuggers in the kernels, and also disables some optimisations that would hinder debugging. To use memory debugging in DDT with CUDA "--cudart shared" must also be passed to nvcc.
To launch a CUDA job, tick the CUDA box on the run dialog before clicking run/submit. You may also enable memory debugging for CUDA programs from the CUDA section. See section 12.2 CUDA memory debugging for details.
Attaching to running CUDA applications is not possible if the application has already initialized the driver in some way, for example through having executed any kernel or called any functions from the CUDA library.
Controlling GPU threads is integrated with the standard Arm DDT controls, so that the usual play, pause, and breakpoints are all applicable to GPU kernels.
CUDA Breakpoints can be set in the same manner as other breakpoints in Arm DDT. See section 7.6 Setting breakpoints.
Breakpoints affect all GPU threads, and cause the application to stop whenever a thread reaches the breakpoint. Where kernels have similar workload across blocks and grids, then threads tend to reach the breakpoint together and the kernel pauses once per set of blocks that are scheduled, that is, the set of threads that fit on the GPU at any one time.
Where kernels have divergent distributions of work across threads, timing may be such that threads within a running kernel hit a breakpoint and pause the kernel. After continuing, more threads within the currently scheduled set of blocks will hit the breakpoint and pause the application again.
In order to apply breakpoints to individual blocks, warps or threads, conditional breakpoints can be used. For example using the built-in variables threadIdx.x (and threadIdx.y or threadIdx.z as appropriate) for thread indexes and setting the condition appropriately.
The GPU execution model is noticeably different from that of the host CPU. In the context of stepping operations, that is step in, step over or step out, there are critical differences to note.
The smallest execution unit on a GPU is a warp, which on current NVIDIA GPUs is 32 threads. Step operations can operate on warps but nothing smaller.
Arm DDT also makes it possible to step whole blocks, whole kernels or whole devices. The stepping mode is selected using the drop down list in the CUDA Thread Selector.
GPU execution under the control of a debugger is not as fast as running without a debugger. When stepping blocks and kernels these are sequentialized into warps and hence stepping of units larger than a warp may be slow. It is not unusual for a step operation to take 60 seconds on a large kernel, particularly on newer devices where a step could involve stepping over a function call.
It is not currently possible to "step over" or "step out" of inlined GPU functions.
GPU functions are often inlined by the compiler. This can be avoided (dependent on hardware) by specifying the __noinline__ keyword in your function declaration, and by compiling your code for a later GPU profile. For example, by adding -arch=sm_20 to your compile line.
Clicking the "Play/Continue" button in DDT runs all GPU threads. It is not possible to run individual blocks, warps or threads.
Much of the user interface when working with GPUs is unchanged from regular MPI or multithreaded debugging. However, there are a number of enhancements and additional features that have been added to help understand the state of GPU applications.
These changes are summarized in the following section.
The Thread Selector allows you to select your current GPU thread. The current thread is used for the variable evaluation windows in DDT, along with the various GPU stepping operations.
The first entries represent the block index, and the subsequent entries represent the 3D thread index inside that block.
Changing the current thread updates the local variables, the evaluations, and the current line displays and source code displays to reflect the change.
The thread selector is also updated to display the current GPU thread if it changes as a result of any other operation. For example if:
- The user changes threads by selecting an item in the Parallel Stack View.
- A memory error is detected and is attributed to a particular thread.
- The kernel has progressed, and the previously selected thread is no longer present in the device.
The GPU Thread Selector also displays the dimensions of the grid and blocks in your program.
It is only possible to inspect/control threads in the set of blocks that are actually loaded in to the GPU. If you try to select a thread that is not currently loaded, a message is displayed.
The Parallel Stack View has been updated to display the location and number of GPU threads.
Clicking an item in the Parallel Stack View selects the appropriate GPU thread, updating the variable display components accordingly and moving the source code viewer to the appropriate location.
Hovering over an item in the Parallel Stack view also allows you to see which individual GPU thread ranges are at a location, as well as the size of each range.
Given a simple kernel that is to calculate an output value for each index in an array, it is not easy to check whether the value at position x in an array has been calculated, or whether the calculating thread has yet to be scheduled.
This contrasts sharply with scalar programming, where if the counter of a (up-)loop exceeds x then the value of index x can be taken as being the final value. If it is difficult to decide whether array data is fresh or stale, then clearly this will be a major issue during debugging.
Arm DDT includes a component that makes this easy, the Kernel Progress display, which appears at the bottom of the user interface by default when a kernel is in progress.
This view identifies the kernels that are in progress. The number of kernels are identified and grouped by different kernel identifiers across processes. The identifier is the kernel name.
A colored progress bar is used to identify which GPU threads are in progress. The progress bar is a projection onto a straight line of the (potentially) 6-dimensional GPU block and thread indexing system and is tailored to the sizes of the kernels operating in the application.
By clicking within the color highlighted sections of this progress bar, a GPU thread will be selected that matches the click location as closely as possible. Selected GPU threads are colored blue. For deselected GPU threads, the ones that are scheduled are colored green whereas the unscheduled ones are white.
The source code viewer allows you to visualize the program flow through your source code by highlighting lines in the current stack trace. When debugging GPU kernels, it will color highlight lines with GPU threads present and display the GPU threads in a similar manner to that of regular CPU threads and processes. Hovering over a highlighted line in the code viewer will display a summary of the GPU threads on that line.
One of the challenges of GPU programming is in discovering device parameters, such as the number of registers or the device type, and whether a device is present.
In order to assist in this, Arm DDT includes a GPU Devices display. This display examines the GPUs that are present and in use across an application, and groups the information together scalably for multi-process systems.
Attaching to a running GPU application and then debugging the GPU threads is only supported for Fermi class cards and their successors. This includes Tesla C2050/2070, K10, and K20.
To attach to a running job, please see the section 5.9 Attaching to running programs and select the Debug CUDA button on the attach window.
In CUDA 7.0, NVIDIA introduced support for GPU code to generate core files. These can be opened in DDT in exactly the same way as core files generated by CPU code. See 5.8 Opening core files for details.
CUDA allows debugging of multiple CUDA processes on the same node. However, each process will still attempt to reserve all of the available GPUs for debugging.
This works for the case where a single process debugs all GPUs on a node, but not for multiple processes debugging a single GPU.
A temporary workaround when using Open MPI is to export the following environment variable before starting DDT:
This will assign a single device (based on local rank) to each process. In addition:
- You must have Open MPI (Compatibility) selected in the File → Options (Arm Forge → Preferences on Mac OS X) . (Not Open MPI).
- The device selected for each process will be the only device visible when enumerating GPUs. This cause manual GPU selection code to stop working (due to changing device IDs, and so on).
- DDT supports versions 8.0 onwards of the NVIDIA CUDA toolkit. In all cases, the most recent CUDA toolkit and driver versions is recommended.
- X11 cannot be running on any GPU used for debugging. (Any GPU running X11 will be excluded from device enumeration.)
- You must compile with -g -G to enable GPU debugging otherwise your program will run through the contents of kernels without stopping.
- Debugging 32-bit CUDA code on a 64-bit host system is not supported.
- It is not yet possible to spot unsuccessful kernel launches or failures. An error code is provided by getCudaLastError() in the SDK which you can call in your code to detect this. Currently the debugger cannot check this without resetting it, which is not desirable behavior.
- Device memory allocated via cudaMalloc() is not visible outside of the kernel function.
- Not all illegal program behavior can be caught in the debugger, for example, divide-by-zero.
- Device allocations larger than 100 MB on Tesla GPUs, and larger than 32 MB on Fermi GPUs, may not be accessible.
- Breakpoints in divergent code may not behave as expected.
- Debugging applications with multiple CUDA contexts running on the same GPU is not supported.
- If CUDA environment variable CUDA_VISIBLE_DEVICES <index> is used to target a particular GPU, then make sure no X server is running on any of the GPUs. Also note that any GPU running X will be excluded from enumeration, with may affect the device Ids.
- CUDA drivers requires that applications be debugged in a mode matching their version. If your system is running with a toolkit version lower than the CUDA driver version, you should force DDT to use the correct CUDA version by setting the ALLINEA_FORCE_CUDA_ VERSION enviroment variable. For example, if you are using the CUDA 8.0 driver, set ALLINEA_FORCE_CUDA_VERSION=8.0. Alternatively, you should consider upgrading your CUDA toolkit to match the CUDA driver.
- If memory debugging and CUDA support are enabled in Arm DDT then only threaded memory preloads are available.
For GPUs that have SM type less than sm_20 (or when code is compiled targeting SM type less than sm_20), the following issues may apply.
- GPU code targeting less than SM type sm_20 will inline all function calls. This can lead to behavior such as not being able to step over/out of subroutines.
- Debugging applications using textures is not supported on GPUs with SM type less than sm_20.
- If you are debugging code in device functions that get called by multiple kernels, then setting a breakpoint in the device function will insert the breakpoint in only one of the kernels.
In addition to the native nvcc compiler, a number of other compilers are supported.
Cray OpenACC is fully supported by Arm DDT. Code pragmas are highlighted, most variables are visible within the device, and stepping and breakpoints in the GPU code is supported. The compiler flag -g is required for enabling device (GPU-based) debugging; -O0 should not be used, as this disables use of the GPU and runs the accelerated regions on the CPU.
You should be aware of the following known issues:
- It is not possible to track GPU allocations created by the Cray OpenACC compiler as it does not directly call cudaMalloc.
- Pointers in accelerator code cannot be dereferenced in CCE 8.0.
- Memory consumption in debugging mode can be considerably higher than regular mode, if issues with memory exhaustion arise, consider using the environment variable CRAY_ACC_MALLOC_HEAPSIZE to set total heap size (bytes) used on the device, which can make more memory available to the application.
Arm DDT supports debugging both the host and CUDA parts of PGI Accelerator and CUDA Fortran programs compiled with version 14.4 or later of the PGI compiler. Older versions of the PGI compiler support debugging only on the host.
Arm DDT supports debugging both the host and CUDA parts of OpenMP programs making use of offloading when compiled with version 13.1.7 or later of the IBM XLC/XLF compilers.
For the best debugging experience of offloading OpenMP regions, the following compiler flags are recommended -g -O0 -qsmp=omp:noopt -qoffload -qfullpath -qnoinline -Xptxas -O0 -Xllvm2ptx -nvvm-compile-options=-opt=0.