System Performance Analysis and the Arm Performance Monitor Unit (PMU)

This tutorial explains how Cycle Models of Arm CPUs enable system performance analysis by providing access to the Performance Monitor Unit (PMU). This includes an introduction to using the Cycle Model System Analyzer to automatically gather information on Arm PMU events for bare metal and Linux software loads.

Introduction

Cycle Models enable system performance analysis by providing access to the Performance Monitor Unit (PMU). They instrument the PMU registers and record PMU events into the Cycle Model System Analyzer database without any software programming. This non-intrusive PMU event collection contrasts other common methods of software execution, such as:

  • Arm Fast Models focus on speed and have limited ability to access PMU events
  • Simulating or emulating CPU RTL does not provide automatic instrumentation and event collection
  • Silicon requires software programming to enable and collect events from the PMU

PMU Events

The Arm Cortex-A53 processor PMU implements the PMUv3 architecture and gathers statistics on the processor and memory system. It provides six counters which can count any of the available events. The Cycle Model for the Cortex-A53 instruments the PMU events to gather statistics without any software programming. This means all of the PMU events (not just six) can be captured from a single simulation.

The Cortex-A53 PMU Events can be found in the Technical Reference Manual (TRM) in Chapter 12. Below is a partial list of PMU events just to provide some flavor of the types of events that are collected. The TRM details all of the events the PMU contains.


PMU Events


Profiling can be enabled by right-clicking on a CPU model and selecting the Profiling menu. Any or all of the PMU events can be enabled. Any simulation done with profiling enabled will write the selected PMU events into the System Analyzer database.


Profiling


Bare Metal Software

The automatic instrumentation of PMU events is ideal for bare metal software since it requires no programming and will automatically cover the entire timeline of the software test or benchmark. Full control is available to enable the PMU events at any time by stopping the simulator and enabling or disabling profiling.

All of the profiling data from the PMU events, as well as the bus transactions, and the software profiling information ends up in the Cycle Model System Analyzer database. The screenshot below shows a section of the System Analyzer GUI loaded with PMU events, bus activity, and software activity. 


ARM Software Analyzer


The System Analyzer provides many out-of-the-box calculation of interesting metrics as well as a complete API which allows plugins to be written to compute additional system or application specific metrics.


Linux Performance Analysis

When using Linux, the PMU is often used to run benchmarks to profile how the software executes on a given hardware design. Linux can be booted quickly and then a benchmark can be run using a cycle accurate virtual prototype by making use of Swap & Play.

Profiling enables events to be collected in the analyzer database, but the user doesn’t have the ability to understand which events apply to each Linux process or to differentiate events from the Linux kernel vs. those from user space programs. It’s also more difficult to determine when to start and stop event collection for a Linux application. Control can be improved by using techniques from Using Linux Swap & Play with ARM Cortex-A Systems.


Using PMU Counters from User Space

Since the PMU can be used for Linux benchmarks, write the initialization code to setup the PMU, enable counters, run the test, and collect the PMU events at the end. This works well for those willing to write system control coprocessor instructions.


Enable User Space Access

  1. To begin writing a Linux application which accesses the PMU, enable user mode access. This needs to be done from the Linux kernel, but requires a kernel module to be loaded or compiled into the kernel. This is needed to set bit 0 in the PMUSERENR register to a 1. It takes only one instructions, but it must be executed from within the kernel. The main section of code is shown below.

  2. Enable User Space Access


  3. Building a kernel module requires a source tree for the running kernel. If you are using a CPAK this source tree is available in the tool or can easily be downloaded by using the CPAK scripts. 
  4. The module can either be loaded dynamically into a running kernel or added to the static kernel build. When working with CPAKs it’s often easiest to add it to the kernel, whereas when working with a board where the module can be natively compiled on the machine, it’s easier to dynamically load it using:

    $ sudo insmod enable_pmu.ko

  5. Use the lsmod command to see which modules are loaded and the rmmod command to unload it when finished.
  6. The exit function of the module returns the user mode enable bit back to 0 to restore the original value.

PMU Application

  1. Once user mode access to the PMU has been granted, benchmark programs can take advantage of the PMU to count events such as cycles and instructions. One possible flow from a user space program is:
    • Reset count values
    • Select which of the six PMU counter registers to use
    • Set the event to be counted, such as instructions executed
    • Enable the counters to start counting
  2. Once this is done, the benchmark application can read the current values, run the code of interest, and then read the values again to determine how many events occurred during the code of interest.

PMU Application


The cycle counter is distinct from the other 6 event count registers. It is read from a separate CP15 system control register. For this example, event 0x8 is monitored, instruction architecturally executed, using event count register 0. 


Summary

Cycle Models provide full access to all PMU events during a single simulation with no software changes and no limitations on the number of events captured. Additional control can be achieved by writing software to access the PMU directly from a Linux test program or benchmark application.

This article was originally written as a blog by Jason Andrews. Read the original post on Connected Community.