CUDA debugging
Arm DDT fully supports applications produced in CUDA C, C++, Fortran and OpenACC. Arm DDT is used for debugging these applications in industries that require high performance computing.
The following sections offer best practices on using Arm DDT for CUDA.
Set a breakpoint
You can set a breakpoint at any line of CUDA source code, in the same way that you can with a CPU debugger. When a block of CUDA threads reaches a specific line, the debugger pauses the whole application.
Explore behavior
GPUs include substantial SIMD instructions and allow thousands of active threads to run simultaneously. The debugger enables you to identify and select a specific CUDA thread by its index or select a thread on a particular line of code.

Stepping a thread displays how a kernel progresses, although CUDA GPUs are slightly different to CPUs and execute threads in groups.
Note: Other GPU threads in the same warp (usually 32 threads) progress at the same time. Examine whether the thread you are monitoring moves through the code as intended.
Visualize data
Each CUDA thread has its own register variables but can share other memory with threads in the same block, the whole device or even the host. Make sure that the data resides in the expected type of memory. While single values can be easily identified, it is helpful for identifying unusual values to visualize array data or to filter data.
Tip: Generate a second visualization to compare and check GPU data with the CPU copy.
Verify memory usage
Before you start your application using Arm DDT, select the option to debug CUDA memory usage. Not all arrays are multiples of the warp-size and it is common to make an error that reads beyond an array in CUDA. While those errors are not always fatal, they do cause non-deterministic behavior that can lead to failure at unexpected times. The debugger identifies these errors so you can fix them to prevent issues.