Arm MAP provides code viewing, editing and rebuilding features. It also integrates with most major version control systems and provides static analysis to automatically detect many classes of common errors.
The code editing and rebuilding capabilities are not designed for developing applications from scratch, but they are designed to fit into existing profiling sessions that are running on a current executable.
Source and header files found in the executable are reconciled with the files present on the front-end server, and displayed in a simple tree view within the Project Files tab of the Project Navigator window. Source files can be loaded for viewing by clicking on the file name.
The source code viewer supports automatic color syntax highlighting for C and Fortran.
You can hide functions or subroutines you are not interested in by clicking the '-' glyph next to the first line of the function. This will collapse the function. Simply click the '+' glyph to expand the function again.
The centre pane shows your source code, annotated with performance information. All the charts you will see in MAP share a common horizontal time axis. The start of your job is at the left and the end at the right. The sparkline charts next to each line of source code shows how the number of cores executing that line of code varies over time.
What does it mean to say a core is executing a particular line of code? In the source code view, MAP uses inclusive time, that is time spent on this line of code or inside functions called by this line. So the main() function of a single-threaded C or MPI program is typically at 100% for the entire run.
Only 'interesting' lines get charts, that is, lines in which at least 0.1% of the selected time range was spent. In the previous figure you can see three different lines meet this criteria. The other lines were executed as well, but a negligible amount of time was spent on them.
The first line is a function call to imbalance, which was running for 18.1% of the wall-clock time. If you look closely, you will see that as well as a large block of green there is a sawtooth pattern in blue. Color is used to identify different kinds of time. In this single-threaded MPI code there are three colors:
- Dark green Single-threaded computation time. For an MPI program, this is all computation time. For an OpenMP or multi-threaded program, this is the time the main thread was active and no worker threads were active.
- Blue MPI communication and waiting time. All time spent inside MPI calls is blue, regardless of whether that is in MPI_Send or MPI_Barrier. Typically you want to minimize this, because the purpose of most codes is parallel computation, not communication for its own sake.
- Orange I/O time. All time spent inside known I/O functions such as reading and writing to the local or networked filesystem is shown in orange. You definitely want to minimize time spent in I/O and on many systems the complex data storage hierarchy can cause unexpected bottlenecks to occur when scaling a code up. MAP always shows the time from the application's point of view, so all the underlying complexity is captured and represented as simply as possible.
- Dark purple Accelerator. All the time the CPU is waiting the accelerator to return the control to the CPU. Typically you want to minimize this, making the CPU work in parallel with the accelerator using accelerator asynchronous calls.
In the above screenshot you can see the following:
- First a function called imbalance is called. This function spends most of its time in computation (dark green) and around 15-20% of it in MPI calls (blue). Hovering the mouse over any graph shows an exact breakdown of the time spent in it. There is a sawtooth pattern to the time spent in MPI calls that will be investigated later.
- Next the application moves on to a function called stride, which spends all of its time computing. You will see how to tell whether this time is well spent or not. You can also see an MPI synchronization at the end. The triangle shape is typical of ranks finishing their work at different times and spending varying amounts of time waiting at a barrier. Where you see triangles in these charts that indicates imbalance.
- Finally, a function called overlap is called, which spends almost all of its time in MPI calls.
- The other functions in this snippet of source code were active for ¡0.1% of the total runtime and can be ignored from a profiling point of view.
As this was an MPI program, the height of each block of color represents the percentage of MPI processes that were running each particular line at any moment in time. So the sawtooth pattern of MPI usage actually tells us that:
- The imbalance function goes through several iterations.
- In each iteration all processes start out computing, there is more green than blue.
- As execution continues more and more processes finish computing and transition to waiting in an MPI call, causing the distinctive triangular pattern showing workload imbalance.
- As each triangle ends all ranks finish communicating and the pattern begins again with the next iteration.
This is a classic sign of MPI imbalance. In fact, any triangular patterns in MAP's graphs show that first a few processes are changing to a different state of execution, then more, then more until they all synchronize and move on to another state together. These areas should be investigated.
You can explore this situation in more detail by opening the examples/slow.map file and looking at the imbalance function yourself. Can you see why some processes take longer to finish computing than others?
In an OpenMP or multi-threaded program (or a mixed-mode MPI+OpenMP program) you will also see these colors used:
- Light green Multi-threaded computation time. For an OpenMP program this is time inside OpenMP regions. When profiling an OpenMP program you want to see as much light green as possible, because that is the only time you are using all available cores. Time spent in dark green is a potential bottleneck because it is serial code outside an OpenMP region.
- Light blue Multi-threaded MPI communication time. This is MPI time spent waiting for MPI communication while inside an OpenMP region or on a pthread. As with the normal blue MPI time you will want to minimize this, but also maximize the amount of multi-threaded computation (light green) that is occurring on the other threads while this MPI communication is taking place.
- Dark Gray Time inside an OpenMP region in which a core is idle or waiting to synchronize with the other OpenMP threads. In theory, during an OpenMP region all threads are active all of the time. In practice there are significant synchronization overheads involved in setting up parallel regions and synchronizing at barriers. These will be seen as dark gray holes in the otherwise happy light green of optimal parallel computation. If you see these there may be an opportunity to improve performance with better loop scheduling or division of the work to be done.
- Pale blue Thread synchronization time. Time spent waiting for synchronization between non-OpenMP threads (for example, a pthread_join). Whether this time can be reduced depends on the purpose of the threads in question.
In the screenshot above you can see that 11.1% of the time is spent calling neighbor.build(atom) and 78.4% of the time is spent calling force->compute(atom, neighbor, comm, comm.me). The graphs show a mixture of light green indicating an OpenMP region and dark gray indicating OpenMP overhead. OpenMP overhead is the time spent in OpenMP that is not the contents of an OpenMP region (user code). Hovering the mouse over a line will show the exact percentage of time spent in overhead, but visually you can already see that it is significant but not dominant here.
Increasingly, programs use both MPI and OpenMP to parallelize their workloads efficiently. MAP fully and transparently supports this model of working. It is important to note that the graphs are a reflection of the application activity over time:
- A large section of blue in a mixed-mode MPI code means that all the processes in the application were inside MPI calls during this period. Try to reduce these, especially if they have a triangular shape suggesting that some processes were waiting inside MPI while others were still computing.
- A large section of dark green means that all the processes were running single-threaded computations during that period. Avoid this in an MPI+OpenMP code, or you might as well leave out the OpenMP sections altogether.
- Ideally you want to achieve large sections of light green, showing OpenMP regions being effectively used across all processes simultaneously.
- It is possible to call MPI functions from within an OpenMP region. MAP only supports this if the main thread (the OpenMP master thread) is the one that makes the MPI calls. In this case, the blue block of MPI time will be smaller, reflecting that one OpenMP thread is in an MPI function while the rest are doing something else such as useful computation.
In a program using NVIDIA CUDA CPU, time spent waiting for GPU kernels to complete is shown in Purple.
When CUDA kernel analysis mode is enabled (see Section 30 ) MAP will display also display data for lines inside CUDA kernels. These graphs show when GPU kernels were active, and for each kernel a breakdown of the different types of warp stalls that occurred on that line. The different types of warp stalls are listed in Section 30.1 . Refer to the tooltip or selected line display (Section 19.2 ) to get the exact breakdown, but in general:
- Purple Selected. Instructions on this line were being executed on the GPU.
- Dark Purple Not selected. This means warps on this line were ready to execute but that there was no available SM to do the executing.
- Red (various shades) Memory operations. Warps on this line were stalled waiting for some memory dependency to be satisfied. Shade of red indicates the type of memory operation.
- Blue (various shades) Execution dependency. Warps on this line were stalled until some other action completes. Shade of blue indicates the type of execution dependency.
Note that warp stalls are only reported per-kernel, so it is not possible to obtain the times within a kernel invocation at which different categories of warp stalls occurred. As function calls in CUDA kernels are also automatically fully inlined it is not possible to see warp stalls for 'time spent inside function(s) on line' for GPU kernel code.
In this screenshot a CUDA kernel involving this line was running on this line 13.1% of the time, with most of the warps waiting for a memory access to complete. The colored horizontal range indicates when any kernel observed to be using this source line was on the GPU. The height of the colored region indicates the proportion of sampled warps that were observed to be on this line. See the NVIDIA CUPTI documentation at http://docs.nvidia.com/cuda/cupti/r_main.html#r_pc_sampling for more information on how warps are sampling.
Real-world scientific codes do not look much like the examples above. They tend to look more like the following:
Here, small amounts of processing are distributed over many lines, and it is difficult to see which parts of the program are responsible for the majority of the resource usage.
To understand the performance of complex blocks of code like this, MAP allows supports code folding. Each logical block of code such as an if-statement or a function call has a small [-] next to it. Clicking this folds those lines of code into one and shows one single sparkline for the entire block:
Now you can clearly see that most of the processing occurs within the conditional block starting on line 122.
When exploring a new source file, a good way to understand its performance is to use the View->Fold All menu item to collapse all the functions in the file to single lines, then scroll through it looking for functions that take an unusual amount of time or show an unusual pattern of I/O or MPI overhead. These can then be expanded to show their most basic blocks, and the largest of these can be expanded again and so on.
Source code may be edited in the code viewer windows of MAP. The actions Undo, Redo, Cut, Copy, Paste, Select all, Go to line, Find, Find next, Find previous, and Find in files are available from the Edit menu.
Files may be opened, saved, reverted and closed from the File menu.
Note that information from MAP will not match edited source files until the changes are saved, the binary is rebuilt, and a new profile is recreated.
If the currently selected file has an associated header or source code file, it can be opened by right-clicking in the editor and choosing Open <filename>.<extension>. There is a global shortcut on function key F4, available in the Edit menu as Switch Header/Source option.
To edit a source file in an external editor, right-click the editor for the file and choose Open in external editor. To change the editor used, or if the file does not open with the default settings, open the Options window by selecting File → Options (Arm Forge → Preferences on Mac OS X) and enter the path to the preferred editor in the Editor box, for example /usr/bin/gedit.
If a file is edited the following warning is displayed at the top of the editor.
To configure the build command choose File → Configure Build…, enter a build command and a directory in which to run the command, and click Apply.
Changes to source files may be committed using one of Git, Mercurial, and Subversion. To commit changes choose File → Commit…, enter a commit message to the resulting dialog and click the commit button.