You copied the Doc URL to your clipboard.

24 Metrics View

This section describes how the metrics view works with the source code, the stacks and the project files views to help you identify and understand performance problems.

PIC

Figure 113: Metrics view

The horizontal axis is wall clock time. By default three metric graphs are shown. The top-most is the Main thread activity chart, which uses the same colors and scales as the per-line sparkline graphs described in section 18 . To understand the Main thread activity chart, read that section first.

For CUDA programs profiled with CUDA kernel analysis mode enabled a "warp stall reasons" graph is also displayed. This shows the warp stalls for all CUDA kernels detected in the program, using the same colors and scales as the GPU kernel graphs described in section 30.1 ). To understand the warp stall reason chart, read that section first.

All of the other metric graphs show how single numerical measurements vary across processes and time. Initially, two frequently used ones are shown: CPU floating-point and memory usage. However, there are many other metric graphs available, and they can all be read in the same way. Each vertical slice of a graph shows the distribution of values across processes for that moment in time. The minimum and maximum are clear, and shading is used to display the mean and standard deviation of the distribution.

A thin line means all processes had very similar values. A 'fat' shaded region means there is significant imbalance between the processes. Extra details about each moment in time appear below the metric graphs as you move the mouse over them.

The metrics view is at the top of the GUI as it ties all the other views together. Move your mouse across one of the graphs, and a black vertical line appears on every other graph in MAP, showing what was happening at that moment in time.

You can also click and drag to select a region of time within it. All the other views and graphs now redraw themselves to show just what happened during the selected period of time, ignoring everything else. This is a useful way to isolate interesting parts of your application's execution. To reselect the entire time range just double-click or use the Select All button.

PIC

Figure 114: Map with a region of time selected

In the above screenshot a short region of time has been selected around an interesting sawtooth in time in MPI_BARRIER because PE 1 is causing delays. The first block accepts data in PE order, so is badly delayed, the second block is more flexible, accepting data from any PE, so PE 1 can compute in parallel. The Code View shows how compute and comms are serialized in the first block, but overlap in the second.

There are many more metrics other than those displayed by default. Click the Metrics button or right-click on the metric graphs and you can choose one of the following presets or any combination of the metrics beneath them. You can return to the default set of metrics at any time by choosing the Preset: Default option.

24.1 CPU instructions

The following sections describe the CPU instruction metrics available on each platform, x86_64, Armv8-A, Power 8 and Power 9 systems.

Note

Due to differences in processor models, not all metrics are available on all systems.

Tip: When you select one or more lines of code in the code view, MAP will show a breakdown of the CPU Instructions used on those lines. Section 19 describes this view in more detail.

24.1.1 CPU instruction metrics available on x86_64 systems

These metrics show the percentage of time that the active cores spent executing different classes of instruction. They are most useful for optimizing single-core and OpenMP performance.

CPU floating-point: The percentage of time each rank spends in floating-point CPU instructions. This includes vectorized instructions and standard x87 floating-point. High values here suggest CPU-bound areas of the code that are probably functioning as expected.

CPU integer: The percentage of time each rank spends in integer CPU instructions. This includes vectorized instructions and standard integer operations. High values here suggest CPU-bound areas of the code that are probably functioning as expected.

CPU memory access: The percentage of time each rank spends in memory access CPU instructions, such as move, load and store. This also includes vectorized memory access functions. High values here may indicate inefficiently-structured code. Extremely high values (98% and above) almost always indicate cache problems. Typical cache problems include cache misses due to incorrect loop orderings but may also include more subtle features such as false sharing or cache line collisions.

CPU floating-point vector: The percentage of time each rank spends in vectorized floating-point instructions. Optimized floating-point-based HPC code should spend most of its time running these operations. This metric provides a good check to see whether your compiler is correctly vectorizing hotspots. See section H.6 for a list of the instructions considered vectorized.

CPU integer vector: The percentage of time each rank spends in vectorized and integer instructions. Optimized integer-based HPC code should spend most of its time running these operations. This metric provides a good check to see whether your compiler is correctly vectorizing hotspots. See section H.6 for a list of the instructions considered vectorized.

CPU branch: The percentage of time each rank spends in test and branch-related instructions such as test, cmp and je. An optimized HPC code should not spend much time in branch-related instructions. Typically the only branch hotspots are during MPI calls, in which the MPI layer is checking whether a message has been fully-received or not.

24.1.2 CPU instruction metrics available on Armv8-A systems

Note

These metrics are not available on virtual machines. Linux perf events performance events counters must be accessible on all systems on which the target program runs.

The CPU instruction metrics available on Armv8-A systems are:

Cycles per instruction The number of CPU cycles to execute an instruction. It is less than 1 when the CPU takes advantage of instruction-level parallelism.

L2 Data cache miss The percentage of data L2 cache accesses that result in a miss.

Branch mispredicts The rate of speculatively-executed instructions that do not retire due to incorrect prediction.

Stalled backend cycles The percentage of cycles where no operation was issued because of the backend, due to a lack of required resources. Data-cache misses can be responsible for this.

Stalled frontend cycles The percentage of cycles where no operation was issued because of the frontend, due to fetch starvation. Instruction-cache and i-TLB misses can be responsible for this.

24.1.3 CPU instruction metrics available on IBM Power 8 systems

Note

These metrics are not available on virtual machines. Linux perf events performance events counters must be accessible on all systems on which the target program runs.

The CPU instruction metrics available on IBM Power 8 systems are:

Cycles per instruction The number of CPU cycles to execute an instruction when the thread is not idle. It is less than 1 when the CPU takes advantage of instruction-level parallelism.

CPU FLOPS lower bound The rate at which floating-point operations completed.

Note

This is a lower bound because the counted value does not account for the length of vector operations.

CPU Memory Accesses The processor's data cache was reloaded from local, remote, or distant memory due to a demand load.

CPU FLOPS vector lower bound The rate at which vector floating-point instructions completed.

Note

This is a lower bound because the counted value does not account for the length of vector operations.

CPU branch mispredictions The rate of mispredicted branch instructions. This counts the number of incorrectly predicted retired branches that are conditional, unconditional, branch and link, return or eret.

24.1.4 CPU instruction metrics available on IBM Power 9 systems

Note

These metrics are not available on virtual machines. Linux perf events performance events counters must be accessible on all systems on which the target program runs.

The CPU instruction metrics available on IBM Power 9 systems are:

Cycles per instruction The number of CPU cycles to execute an instruction when the thread is not idle. It is less than 1 when the CPU takes advantage of instruction-level parallelism.

L3 cache miss per instruction The ratio of completed L3 data cache demand loads to instructions.

Branch mispredicts The rate of branches that were mispredicted.

Stalled backend cycles The percentage of cycles where no operation was issued because of the backend, due to a lack of required resources. Data-cache misses can be responsible for this.

24.2 CPU time

These metrics are particularly useful for detecting and diagnosing the impact of other system daemons on your program's run.

CPU time This is the percentage of time that each thread of your program was able to spend on a core. Together with Involuntary context switches, this is a key indicator of oversubscription or interference from system daemons. If this graph is consistently less than 100%, check your core count and CPU affinity settings to make sure one or more cores are not being oversubscribed. If there are regular spikes in this graph, show it to your system administrator and ask for their help in diagnosing the issue.

User-mode CPU time The percentage of time spent executing instructions in user-mode. This should be close to 100%. Lower values or spikes indicate times in which the program was waiting for a system call to return.

Kernel-mode CPU time Complements the above graph and shows the percentage of time spent inside system calls to the kernel. This should be very low for most HPC runs. If it is high, show the graph to your system administrator and ask for their help in diagnosing the issue.

Voluntary context switches The number of times per second that a thread voluntarily slept, for example while waiting for an I/O call to complete. This is normally very low for a HPC code.

Involuntary context switches The number of times per second that a thread was interrupted while computing and switched out for another one. This will happen if the cores are oversubscribed, or if other system processes and daemons start running and take CPU resources away from your program. If this graph is consistently high, check your core count and CPU affinity settings to make sure one or more cores are not being oversubscribed. If there are regular spikes in this graph, show it to your system administrator and ask for their help in diagnosing the issue.

System load The number of active (running or runnable) threads as a percentage of the number of physical CPU cores present in the compute node. This value may exceed 100% if you are using hyperthreading, if the cores are oversubscribed, or if other system processes and daemons start running and take CPU resources away from your program. A value consistently less than 100% may indicate your program is not taking full advantage of the CPU resources available on a compute node.

24.3 I/O

These metrics show the performance of the I/O subsystem from the application's point of view. Correlating these with the I/O time in the Application Activity chart helps to diagnose I/O bottlenecks.

POSIX I/O read rate: The total I/O read rate of the application. This may be greater than Disk read transfer if data is read from the cache instead of the storage layer.

POSIX I/O write rate: The total I/O write rate of the application. This may be greater than Disk write transfer if data is written to the cache instead of the storage layer.

Disk read transfer: The rate at which the application reads data from disk, in bytes per second. This includes data read from network filesystems (such as NFS), but may not include all local I/O due to page caching.

Disk write transfer: The rate at which the application writes data to disk, in bytes per second. This includes data written to network filesystems.

POSIX read syscall rate: The rate at which the application invokes the read system call. Measured in calls per second, not the amount of data transferred.

POSIX write syscall rate: The rate at which the application invokes the write system call. Measured in calls per second, not the amount of data transferred.

Note

Disk transfer and I/O metrics are not available on Cray X-series systems as the necessary Linux kernel support is not enabled.

Note

I/O time in the Application Activity chart done via direct kernel calls will not be counted.

Note

Even if your application does not perform I/O, a non-zero amount of I/O will be recorded at the start of profile because of internal I/O performed by MAP.

24.4 Memory

Here the memory usage of your application is shown in both a per-process and per-node view. Performance degrades severely once all the node memory has been allocated and swap is required. Some HPC systems, notably Crays, will terminate a job that tries to use more than the total node memory available.

Memory usage: The memory in use by the processes currently being profiled. Memory that is allocated and never used is generally not shown. Only pages actively swapped into RAM by the OS are displayed. This means that you will often see memory usage ramp up as arrays are initialized. The slopes of these ramps can be interesting in themselves.

Note

this means if you malloc or ALLOCATE a large amount of memory but do not actually use it the Memory Usage metric will not increase.

Node memory usage: The percentage of memory in use by all processes running on the node, including operating system processes and user processes not in the list of selected ranks when specifying a subset of processes to profile. If node memory usage is far below 100% then your code may run more efficiently using fewer processes or a larger problem size. If it is close to or reaches 100% then the combination of your code and other system daemons are exhausting the physical memory of at least one node.

24.5 MPI

A detailed range of metrics offering insight into the performance of the MPI calls in your application. These are all per-process metrics and any imbalance here, as shown by large blocks with sloped means, has serious implications for scalability.

Use these metrics to understand whether the blue areas of the Application Activity chart are problematic or are transferring data in an optimal manner. These are all seen from the application's point of view.

An asynchronous call that receives data in the background and completes within a few milliseconds will have a much higher effective transfer rate than the network bandwidth. Making good use of asynchronous calls is a key tool to improve communication performance.

In multithreaded applications, MAP only reports MPI metrics for MPI calls from main threads. If an application uses MPI_THREAD_SERIALIZED or MPI_THREAD_MULTIPLE, the Application Activity chart will show MPI activity, but some regions of the MPI metrics may be empty if the MPI calls are from non-main threads.

MPI call duration: This metric tracks the time spent in an MPI call so far. PEs waiting at a barrier (MPI blocking sends, reductions, waits and barriers themselves) will ramp up time until finally they escape. Large areas show lots of wasted time and are prime targets for investigation. The PE with no time spent in calls is likely to be the last one to arrive, so should be the focus for any imbalance reduction.

MPI sent/received: This pair of metrics tracks the number of bytes passed to MPI send/receive functions per second. This is not the same as the speed with which data is transmitted over the network, as that information is not available. This means that an MPI call that receives a large amount of data and completes almost instantly will have an unusually high instantaneous rate.

MPI point-to-point and collective operations: This pair of metrics tracks the number of point-to-point and collective calls per second. A long shallow period followed by a sudden spike is typical of a late sender. Most processes are spending a long time in one MPI call (very low #calls per second) while one computes. When that one reaches the matching MPI call it completes much faster, causing a sudden spike in the graph.

MPI point-to-point and collective bytes: This pair of metrics tracks the number of bytes passed to MPI send and receive functions per second. This is not the same as the speed with which data is transmitted over the network, as that information is not available. This means that an MPI call that receives a large amount of data and completes almost instantly will have an unusually high instantaneous rate.

Note

(for SHMEM users) MAP shows calls to shmem_barrier_all in MPI collectives, MPI calls and MPI call duration. Metrics for other SHMEM functions are not collected.

24.6 Detecting MPI imbalance

The metrics view shows the distribution of their value across all processes against time, so any 'fat' regions are showing an area of imbalance in this metric. Analyzing imbalance in MAP works like this:

  1. Look at the metrics view for any 'fat' regions. These represent imbalance in that metric during that region of time. This tells us (A) that there is an imbalance, and (B) which metrics are affected.
  2. Click and drag on the metrics view to select the 'fat' region, zooming the rest of the controls in to just this period of imbalance.
  3. Now the stacks view and the source code views show which functions and lines of code were being executed during this imbalance. Are the processes executing different lines of code? Are they executing the same one, but with differing efficiencies? This tells us (C) which lines of code and execution paths are part of the imbalance.
  4. Hover the mouse over the fattest areas on the metric graph and watch the minimum and maximum process ranks. This tells us (D) which ranks are most affected by the imbalance.

Now you know (A) whether there is an imbalance and (B) which metrics (CPU, memory, FPU, I/O) it affects. You also know (C) which lines of code and (D) which ranks to look at in more detail.

Often this is more than enough information to understand the immediate cause of the imbalance (for example, late sender, workload imbalance) but for a deeper view you can now switch to DDT and rerun the program with a breakpoint in the affected region of code. Examining the two ranks highlighted as the minimum and maximum by MAP with the full power of an interactive debugger helps get to the root cause of the imbalance behavior.

24.7 Accelerator

If you have Arm Forge Professional, the NVIDIA CUDA accelerator metrics are enabled on x86_64. Please contact Arm Sales at HPCToolsSales@arm.com for information on how to upgrade.

Note

Accelerator metrics are not available when linking to the static MAP sampler library.

GPU temperature: The temperature of the GPU as measured by the on-board sensor.

GPU utilization: Percent of time that the GPU card was in use, that is, one or more kernels are executing on the GPU card. If multiple cards are present in a compute node this value is the mean across all the cards in a compute node. Adversely affected if CUDA kernel analysis mode is enabled (see section 30.1 ).

Time in global memory accesses: Percent of time that the global (device) memory was being read or written. If multiple cards are present in a compute node this value is the mean across all the cards in a compute node.

GPU memory usage: The memory allocated from the GPU frame buffer memory as a percentage of the total available GPU frame buffer memory.

24.8 Energy

The energy metrics are only available with Arm Forge Professional. All metrics are measured per node. If you are running your job on more than one node, MAP shows the minimum, mean and maximum power consumption of the nodes.

Note

energy metrics are not available when linking to the static MAP sampler library.

GPU power usage: The cumulative power consumption of all GPUs on the node, as measured by the NVIDIA on-board sensor. This metric is available if the Accelerator metrics are present.

CPU power usage: The cumulative power consumption of all CPUs on the node, as measured by the Intel on-board sensor (Intel RAPL).

System power usage: The power consumption of the node as measured by the Intel Energy Checker or the Cray metrics.

24.8.1 Requirements

CPU power measurement requires an Intel CPU with RAPL support, for example Sandy Bridge or newer, and the intel_rapl powercap kernel module to be loaded.

Node power monitoring is implemented via one of two methods: the Arm IPMI energy agent which can read IPMI power sensors, or the Cray HSS energy counters.

For more information on how to install the Arm IPMI energy agent please see I.7 Arm IPMI Energy Agent. The Cray HSS energy counters are known to be available on Cray XK6 and XC30 machines.

Accelerator power measurement requires a NVIDIA GPU that supports power monitoring. This can be checked on the command-line with nvidia-smi -q -d power. If the reported power values are reported as "N/A", power monitoring is not supported.

24.9 Lustre

Lustre metrics are enabled if your compute nodes have one or more Lustre filesystems mounted. Lustre metrics are obtained from a Lustre client process running on each node. Therefore, the data presented gives the information gathered on a per-node basis. The data presented is also cumulative over all of the processes run on a node, not only the application being profiled. Therefore, there may be some data reported to be read and written even if the application itself does not perform file I/O through Lustre. However, an assumption is made that the majority of data read and written through the Lustre client will be from an I/O intensive application, not from background processes. This assumption has been observed to be reasonable. For generated application profiles with more than a few megabytes of data read or written, almost all of the data reported in Arm MAP is attributed to the application being profiled.

The data that is gathered from the Lustre client process is the read and write rate of data to Lustre, as well as a count of some metadata operations. Lustre does not just store pure data, but associates this data with metadata, which describes where data is stored on the parallel file system and how to access it. This metadata is stored separately from data, and needs to be accessed whenever new files are opened, closed, or files are resized. Metadata operations consume time and add to the latency in accessing the data. Therefore, frequent metadata operations can slow down the performance of I/O to Lustre. Arm MAP reports on the total number of metadata operations, as well as the total number of file opens that are encountered by a Lustre client. With the information provided in Arm MAP you can observe the rate at which data is read and written to Lustre through the Lustre client, as well as be able to identify whether a slow read or write rate can be correlated to a high rate of expensive metadata operations.

Notes:

  • For jobs run on multiple nodes, the reported values are the mean across the nodes.
  • If you have more than one Lustre filesystem mounted on the compute nodes the values are summed across all Lustre filesystems.
  • Metadata metrics are only available with Arm Forge Professional.

Lustre read transfer: The number of bytes read per second from Lustre.

Lustre write transfer: The number of bytes written per second to Lustre.

Lustre file opens: The number of file open operations per second on a Lustre filesystem.

Lustre metadata operations: The number of metadata operations per second on a Lustre filesystem. Metadata operations include file open, close and create as well as operations such as readdir, rename, and unlink.

Note

depending on the circumstances and implementation 'file open' may count as multiple operations, for example, when it creates a new file or truncates an existing one.

24.10 Zooming

To examine a small time range in more detail you can horizontally zoom in the metric graphs by selecting the time-range you wish to see then left-clicking inside that selected region.

All the metric graphs will then resize to display that selection in greater detail. This only effects the metric graphs, as the graphs in all the other views, such as the code editor, will already have redrawn to display only the selected region when that selection was made.

A right-click on the metric graph zooms the metric graphs out again.

This horizontal zoom is limited by the number of samples that were taken and stored in the MAP file. The more you zoom in the more 'blocky' the graph becomes.

While you can increase the resolution by instructing MAP to store more samples (see ALLINEA_SAMPLER_NUM_SAMPLES and ALLINEA_SAMPLER_INTERVAL in 16.11 MAP environment variables) this is not recommended as it may significantly impact performance of both the program being profiled and of MAP when displaying the resulting .map file.

You can also zoom in vertically to better see fine-grained variations in a specific metric's values. The auto-zoom button beneath the metric graphs will cause the graphs to automatically zoom in vertically to fit the data shown in the currently selected time range. As you select new time ranges the graphs automatically zoom again so that you see only the relevant data.

If the automatic zoom is insufficient you can take manual control of the vertical zoom applied to each individual metric graph. Holding down the CTRL key (or the CMD key on Mac OS X ), while either dragging on a metric graph or using the mouse-wheel while hovering over one, will zoom that graph vertically in or out, centered on the current position of the mouse.

A vertically-zoomed metric graph can be panned up or down by either holding down the SHIFT key while dragging on a metric graph or just using the mouse-wheel while hovering over it. Manually adjusting either the pan or zoom will disable auto-zoom mode for that graph, click the auto-zoom button again to reapply it.

Action

Usage

Description

Select

Drag a range in a metric graph.

Selects a time range to examine. Many components (but not the metric graphs) will rescale to display data for this time range only.

Reset

PIC

Click the Reset icon (under the metric graphs).

Selects the entire time range. All components (including the metric graphs) will rescale to display the entire set of data. All metric graphs will be zoomed out.

Horizontal zoom in

Left click a selection in a metric graph.

Zoom in (horizontally) on the selected time range.

Horizontal zoom out

Right-click a metric graph.

Undo the last horizontal zoom in action.

Vertical zoom in/out

Ctrl + mouse scroll wheel or Ctrl + Drag on a metric graph.

Zoom a single metric graph in or out.

Vertical pan

Mouse scroll wheel or Shift+Drag on a metric graph.

Pan a single metric graph up or down.

Automatic vertical zoom

PIC

Toggle the Automatic Vertical Zoom icon (under the metric graphs).

Automatically change the zoom of each metric graph to best fit the range of values each graph contains in the selected time range. Manually panning or zooming a graph will disable auto vertical zoom for that graph only.

24.11 Viewing totals across processes and nodes

The metric graphs show the statistical distribution of the metric across ranks or compute nodes (depending on the metric). So, for example, the Nodes power usage metric graph shows the statistical distribution of power usage of the compute nodes.

If you hover the mouse over the name of a metric to the left hand side of the graph MAP will display a tool tip with additional summary information. The tool tip will show you the Minimum, Maximum, and Mean of the metric across time and ranks or nodes.

For metrics which are not percentages the tool tip will also show the peak sum across ranks / nodes. For example, the Maximum (∑ all nodes) line in the tool tip for Nodes power usage shows the peak power usage summed across all compute nodes. This does not include power used by other components, for example, network switches.

For some metrics which are rates (for example, Lustre read transfer) MAP will also show the cumulative total across all ranks / nodes in the tool tip, for example, Lustre bytes read (∑ all nodes).

24.12 Custom metrics

Custom metrics can be written to collect and expose additional data (for example, PAPI counters) in the metrics view.

User custom metrics should be installed under the appropriate path in your home directory, for example, /home/your_user/.allinea/map/metrics. Custom metrics can also be installed for all users by placing them in the MAP installation directory, for example, /arm_installation_directory/map/metrics. If a metric is installed in both locations, the user installation will take priority.

Detailed information on how to write custom metrics can be found in supplementary documentation bundled with the Arm Forge installation in allinea-metric-plugin-interface.pdf.

Was this page helpful? Yes No