The selected lines view is currently only available for profiles generated on x86_64 systems.
The Selected Lines View view allows you to get detailed information on how one or more lines of code are spending their time.
To access this view, open one of your program's source files in the code viewer and highlight a line.
The Selected Lines View, which is by default shown on the right hand side of the source view, automatically updates to show a detailed breakdown of how the selected lines are spending their time.
You can select multiple lines, and MAP will show information for all of the lines together.
The panel is divided into two sections.
The first section gives an overview of how much time was spent executing instructions on this line, and how much time was spent in other functions.
If the time spent executing instructions is low, consider using the stacks view, or the functions view to locate functions that are using a lot of CPU time. For more information on the Stacks View see section . For more information on the Functions View see section .
The second section details the CPU instruction metrics for the selected line.
Unlike the global program metrics, the line metrics are divided into separate entries for scalar and vector operations, and report time spent in "implicit memory accesses".
On some architectures, computational instructions (such as integer or vector operations) are allowed to access memory implicitly. When these types of instruction are used, MAP cannot distinguish between time performing the operation and time accessing memory, and therefore reports time for the instruction in both the computational category and the memory category.
The amount of time spent in "explicit" and "implicit" memory accesses is reported as a footnote to the time spent executing instructions.
Some guidelines are listed here:
- In general, aim for a large proportion of time in vector operations.
- If you see a high proportion of time in scalar operations, try checking to see if your compiler is correctly optimising for your processor's SIMD instructions.
- If you see a large amount of time in memory operations then look for ways to more efficiently access memory in order to improve cache performance.
- If you see a large amount of time in branch operations then look for ways to avoid using conditional logic in your inner loops.
Modern superscalar processors use instruction-level parallelism to decode and execute multiple operations in a single cycle, if internal CPU resources are free, and will retire multiple instructions at once, making it appear as if the program counter "jumps" several instructions per cycle.
Current architectures do not allow profilers such as MAP (or Intel VTune, Linux perftools and others) to efficiently measure which instructions were "invisibly" executed by this instruction-level parallelism. This time is typically allocated to the last instruction executed in the cycle.
Most MAP users will not be affected by this for the following reasons:
- Hot lines in a HPC code typically contain rather more than a single instruction such as nop. This makes it unlikely that an entire source line will be executed invisibly via the CPU's instruction-level parallelism.
- Any such lines executed "for free" in parallel with another line by a CPU core will clearly show up as a "gap" in the source code view (but this is unusual).
- Loops with stalls and mispredicted branches still show up highlighting the line containing the problem in all but the most extreme cases.
To summarize key points:
- Experts users: those wanting to use MAP's per-line instruction metrics to investigate detailed CPU performance of a loop or kernel (even down to the assembly level) should be aware that instructions executed in parallel by the CPU will show up with time only assigned to the last one in the batch executed.
- Other users: MAP's statistical instruction-based metrics correlate well with where time is spent in the application and help to find areas for optimization. Feel free to use them as such. If you see lines with very few operations on them (such as a single add or multiply) and no time assigned to them inside your hot loops then these are probably being executed "for free" by the CPU using instruction-level parallelism. The time for each batch of such is assigned to the last instruction completed in the cycle instead.
When CUDA kernel analysis is enabled (see section ) and the selected line is executed on the GPU then a breakdown of warp stall reasons on this line will be shown in this view. For a description of each of these warp stall reasons, refer to the tooltip for each of the entries or section .