Using Linux Swap & Play with Arm Cortex-A Systems

This article explains how the availability of Swap & Play models on Cortex-A processors allows Cycle Model users to run Linux benchmark applications for system performance analysis.

Getting Started with Swap & Play

The following tips can help you get started using Swap & Play models with Linux systems:

  1. Launch benchmark software automatically on boot
  2. Set application breakpoints for Swap & Play checkpoints
  3. Add markers in benchmark software to track progress

With the availability of Swap & Play models for a variety of Arm processor technologies, users can create, validate, and analyze the combination of hardware and software using cycle accurate virtual prototypes running realistic software workloads. Combine this with access to models of candidate IP, and the result is a unique flow which delivers cycle accurate Arm system models to design teams.


Create, Analyze, Validate


What is Swap & Play?

Swap & Play technology enables high-performance simulation (based on Fast Models) to be executed up to user specified breakpoints, and the state of the simulation to be resumed using a cycle accurate virtual prototype.

One of the most common uses of Swap & Play is to run Linux benchmark applications to profile how the software executes on a given hardware design. Linux can be booted quickly and then the benchmark run using the cycle accurate virtual prototype. These tips make it easier to automate the entire process and get to the system performance analysis.


Launch Benchmarks on Boot

The first tip is to automatically launch the benchmark when Linux is booted.

Linux CPAKs on System Exchange use a single executable file (.axf) for each system with the following artifacts linked into the images:

  • Minimal Boot loader
  • Kernel image
  • Device Tree
  • RAM-based File System with applications
  1. To customize and automate the execution of a desired Linux benchmark application, a Linux device tree entry can be created to select the application to run after boot. The device tree support for “include” can be used to include a .dtsi file containing the kernel command line, which launches the desired Linux application.
  2. Below is the top of the device tree source file from an Arm Cortex-A15 processor CPAK. If one of the benchmarks to be run is the bw_pipe test from the LMBench suite a .dtsi file is included in the device tree.


    A15 CPAK Device Tree Source File


    The include line pulls in a description of the kernel command line. For example, if the bw_pipe benchmark from LMbench is to be run, the include file contains the kernel arguments shown below:

    Kernel Arguments


  3. The rdinit kernel command line parameter is used to launch a script that executes the Linux application to be run automatically. The bw_pipe.sh can then run the bw_pipe executable with the desired command line arguments.
  4. Scripting or manually editing the device tree can be used to modify the include line for each benchmark to be run. A unique .axf file for each Linux application to be run can be created. This gives an easy to use .axf file that will automatically launch the benchmark without the need for any interactive typing. Having unique .axf files for each benchmark also makes it easy to hand off to other engineers since they don’t need to know anything about how to run the benchmark; just load the .axf file and the application will automatically run.
  5. It is recommended to create an .axf image which runs /bin/bash to use for testing new benchmark applications in the file system. The benchmarks can be manually run from the shell first on the Fast Model to make sure they are working correctly.

Setting Application Breakpoints

Once benchmarks are automatically running after boot, the next step is to set application breakpoints to use for Swap & Play checkpoints. Linux uses virtual memory which can make it difficult to set breakpoints in user space. While there are application-aware debuggers and other techniques to debug applications, most are either difficult to automate or overkill for system performance analysis.

One way to locate breakpoints is to call from the application into the Linux kernel, where it is much easier to put a breakpoint. Any system call which is unused by the benchmark application can be utilized for locating breakpoints. Preferably, the chosen system call would not have any other side effects that would impact the benchmark results.

  1. To illustrate how to do this, consider a benchmark application to be run automatically on boot. Let the first checkpoint be taken when main() begins. Place a call to the sched_yield() function as the first action in main(). Make sure to include the header file sched.h in the C program. This will call into the Linux kernel. A breakpoint can be placed in the Linux kernel file kernel/sched/core.c at the entry point for the sched_yield system call.
  2. Here is the bw_pipe benchmark with the added system call.


    Bw-pipe Benchmark and System Call


  3. Put a breakpoint in the Linux kernel at the system call and when the breakpoint is hit save the Swap & Play checkpoint. Here is the code in the Linux kernel.

  4. Linux Kernel


  5. The same technique can be used to identify other locations in the benchmark application including the end of the benchmark top stop simulation and gather results for analysis.

The sched_yield system call yields the current processor to other threads, but in the controlled environment of a benchmark application it is not likely to do any rescheduling at the start or at the end of a program. If used in the middle of a multi-threaded benchmark it may impact the scheduler.


Tracking Benchmarking Progress

While using print statements may be one method to understanding a benchmark's progress and estimate time of completion, adding too many can negatively impact performance analysis. Even a simple printf() call in a C program to a UART under Linux is a somewhat complex sequence involving the C library, some system calls, UART device driver activations, and 4 or 5 interrupts for an ordinary length string.

  1. Alternatively, to get some feedback about the benchmark application process, bypass all of the printf() overhead and make a system call directly from the benchmark application, using very short strings which can be processed with 1 interrupt and fit in the UART FIFO.
  2. Below is a C program showing how to do it.


    C-program


  3. By using short strings which are just a few characters, it’s easy to insert some markers in the benchmark application to track progress without getting in the way of benchmark results. This tool enables the user to learn what happens when a Linux system call is invoked by tracing the activity in the kernel from the start of the system call to the UART driver.


This article was originally written as a blog by Jason Andrews. Read the original post on Connected Community.