You copied the Doc URL to your clipboard.

30 GPU profiling

When profiling applications that use CUDA 8.0 and above, GPU kernels that can be tracked by NVIDIA's CUDA Profiling Tools Interface (CUPTI) will be displayed in a new "GPU Kernels" tab.


Figure 116: GPU Kernels View

This lists the CUDA kernels that were detected in the program alongside graphs indicating when those kernels were active. If multiple kernels were identified in a process within a particular sample they will have equal weighting in this graph.

Note that:

  • CUDA kernels generated by OpenACC, CUDA Fortran, or offloaded OpenMP regions are not yet supported by MAP.
  • GPU profiling is only supported with CUDA 8.0 and above.
  • GPU profiling is not supported if the CUDA driver and toolkit versions do not match (for example, profiling a CUDA 8.0 program with a CUDA 9.0 driver is not supported).
  • GPU profiling is not supported when statically linking the MAP sampler library.

30.1 Kernel analysis

CUDA kernel analysis mode is an advanced feature that provides insight into the activity within CUDA kernels. This mode can be enabled from the MAP run dialog or from the command line with --cuda-kernel-analysis.


Figure 117: Run window with CUDA kernel analysis enabled

When enabled the "GPU Kernels" tab is enhanced to show a line-level breakdown of warp stalls. The possible categories of warp stall reasons are as listed in the enum CUpti_ActivityPCSamplingStallReason in the CUPTI API documentation (

No stall, instruction is selected for issue.
Instruction fetch
Warp is blocked because next instruction is not yet available, because of an instruction cache miss, or because of branching effects.
Execution dependency
Instruction is waiting on an arithmetic dependency.
Memory dependency
Warp is blocked because it is waiting for a memory access to complete.
Texture sub-system
Texture sub-system is fully utilized or has too many outstanding requests.
Thread or memory barrier
Warp is blocked as it is waiting at __syncthreads or at a memory barrier.
__constant__ memory
Warp is blocked waiting for __constant__ memory and immediate memory access to complete.
Pipe busy
Compute operation cannot be performed due to required resource not being available.
Memory throttle
Warp is blocked because there are too many pending memory operations.
Not selected
Warp was ready to issue, but some other warp issued instead.
Miscellaneous stall reason.
Dropped samples
Samples dropped (not collected) by hardware due to backpressure or overflow.
The stall reason could not be determined. Used when CUDA kernel analysis has not been enabled (see above) or when an internal error occurred within CUPTI or MAP.


Figure 118: GPU kernels view (with CUDA kernel analysis)

Note that warp stalls are only reported per-kernel, so it is not possible to obtain the times within a kernel invocation at which different categories of warp stalls occurred. As function calls in CUDA kernels are automatically fully inlined it is not possible to see a stack trace of code within a kernel on the GPU.

Warp stall information is also present in the code editor (section 18.3 ), the selected line view (section 19.2 ), and in a warp stall reason graph in the metrics view (section 24 ).

30.2 Compilation

When compiling CUDA kernels do not generate debug information for device code (the -G or --device-debug flag) as this can significantly impair runtime performance. Use -lineinfo instead, for example:

    nvcc -c -o device.o -g -lineinfo -O3

30.3 Performance impact

Enabling the CUPTI sampling will impact the target program in the following ways:

  1. A short amount of time will be spent post-processing at the end of each kernel. This will depend on the length of the kernel and the CUPTI sampling frequency.
  2. Kernels will be serialized. Each CUDA kernel invocation will not return until the kernel has finished and CUPTI post-processing has been performed. Without CUDA kernel analysis mode kernel invocation calls return immediately to allow CUDA processing to be performed in the background.
  3. Increased memory usage whilst in a CUDA kernel. This may manifest as fluctuations between two memory usage values, depending on whether a sample was taken during a CUDA kernel or not.

Taken together the above may have a significant impact on the target program, potentially resulting in orders of magnitude slowdown. To combat this profile and analyse CUDA code kernels (with --cuda-kernel-analysis) and non-CUDA code (no --cuda-kernel-analysis) in separate profiling sessions.

The NVIDIA GPU metrics will be adversely affected by this overhead, particularly the "GPU utilization" metric. See section 24.7 .

When profiling CUDA code it may be useful to only profile short subsection of the program so time is not wasted waiting for CUDA kernels you do not need to see data for. See section 16.3.8 for instructions on how to do this.

30.4 Customizing GPU profiling behavior

The interval at which CUPTI samples GPU warps can be modified by the environment variable ALLINEA_SAMPLER_GPU_INTERVAL. Accepted values are max, high, mid, low, and min, with the default value being high. These correspond to the values in the enum CUpti_ActivityPCSamplingPeriod in the CUPTI API documentation (

Reducing the sampling interval means warp samples are taken more frequently. While this may be needed for very short-lived kernels, setting the interval too low can result in a very large number of warp samples being taken which then require significant post-processing time once the kernel completes. Overheads of twice as long as the kernel's normal runtime have been observed. It is recommended that the CUPTI sampling interval is not reduced.

30.5 Known issues

  • GPU profiling is only supported using CUDA 8.0 and above.
  • GPU profiling is not supported if the CUDA driver and toolkit versions do not match (for example, profiling a CUDA 8 program with a CUDA 9 driver is not supported).
  • When preparing your program for profiling, it is advised to match the version of the CUDA toolkit to that of the CUDA driver.
  • CUPTI allocates a small amount of host memory each time a kernel is launched. If your program launches many kernels in a tight loop this overhead can skew the memory usage figures.
  • CUDA kernels generated by OpenACC, CUDA Fortran or offloaded OpenMP regions are not yet supported by MAP.
  • The graphs are scaled on the assumption that there is a 1:1 relationship between processes and GPUs, each process having exclusive use of its own CUDA card. The graphs may be of an unexpected height if some processes do not have a GPU, or if multiple processes share the use of a common GPU.
  • Enabling CUDA kernel analysis mode can have a significant performance impact as described in section 30.3 .
  • GPU profiling is not supported when statically linking the MAP sampler library.