Using DS-5 and Streamline with Linux on Armv8-based platforms
Streamline is a tool that can be used for code optimisation. Streamline analyses an executable by taking samples of the target counters whilst the code is executing. This tutorial provides a comprehensive overview of using Streamline with an Armv8 target.
Configuring the tools
To compile the example binary for an Armv8 target, you will need to install the
aarch64-linux-gnu toolchain. For this tutorial, the binaries were built with the
gcc-linaro-aarch64-linux-gnu-4.9-2014.09_win32 toolchain. This toolchain was not available as part of the DS-5 5.22 installation used to write this tutorial, but is required to compile binaries for a 64-bit environment provided by an ARMv8 target. You will therefore need to add the toolchain to DS-5 yourself.
- You can download the toolchain here. You will need to download the correct
gcc-linaro-aarch64-linux-gnu-4.9-2014.09package for your host operating system.
- After downloading, you can add the toolchain to DS-5 by following these instructions.
Target description and Linux distribution
This tutorial was completed on a Juno board running a LAMP OE filesystem image (
lt-vexpress64-openembedded_lamp-armv8-gcc-4.9_20150620-722.img) on an OpenEmbedded UEFI bootloader (
- To run this tutorial on a Juno board, follow these instructions to download and setup the board's bootloader and filesystem.
In order to use the Streamline analysis tool, a gatord daemon and gator driver are required on the target device. The gatord daemon and gator driver are responsible for collecting the performance samples on the target device and sending the results back to the host. To communicate with Streamline, the gator uses a TCP/IP protocol. For more information, have a look at the Gator documentation included in your DS-5 installation (
<DS-5 installation directory>/arm/gator/protocol).
This Linux distribution has gatord preinstalled. If you are using another image, you must make sure gatord is available, or install it on your Linux target. For more information on how this can be done, take a look at the 'Building the gator daemon' section of the Arm DS-5 Streamline User Guide.
Connecting to the hardware target
Creating a serial connection
To create a serial connection to a hardware target, you can use the serial terminal built into DS-5. For the Juno board, the serial cable should be connected to the UART0 port.
- To view a Terminal window, click Window > Show View > Other... then in the Terminal file, select Terminal.
- Click the Connect button and select Serial in the Connection Type drop down box.
- Select the correct COM port from the available ports listed.
- The Juno board requires a serial connection with a baud rate of 115200, 8 data bits, 1 stop bit and no parity bit
- Click OK.
This serial connection will allow you to access the command line on your target in order to find its IP address using the command
Creating a Remote Systems connection
To connect to the target device, you must first configure a Remote Systems connection.
- To view the Remote Systems window, click Window > Show View > Other... then in the Remote Systems file, select Remote Systems.
- Click the Define a connection to a remote system button to create a new connection.
- In the Select Remote System Type window, select Linux and click Next.
- In the Host name box type in the IP address of your target.
- In the Connection name box, give your connection a name, for example Juno, then click Next.
- In the Configuration box, select ssh.files and then click Finish (if you click Next by mistake, leave all the options as their defaults before selecting Finish).
- When prompted, enter the target's username and password in the dialog box. The example Linux distribution used in this tutorial does not require a root password.
Building the example binary
For this tutorial we are using the threads project included in the
<DS-5 installation directory>examples/Linux_examples folder in the DS-5 release. Instructions on how to import the example projects can be found here.
The project contains a C file containing the code for the threads, prebuilt binaries and debug configurations for both Fixed Virtual Platform (FVP) models and hardware targets, as well as a Makefile used to automate the compiling process.
Before editing the project, it is a good idea to remove the prebuilt binaries that are included in the project. These binaries are compiled for the Armv7 architecture and are therefore incompatible with the Armv8 architecture we are targeting.
- To remove the files, right click on the project file in the Project Explorer window and select Clean Project.
The toolchain associated with the project is
arm-linux-gnueabihf which is used to compile Linux applications for Armv7 architectures.
- To change the associated toolchain to
aarch64-linux-gnu, right-click on the project file in the Project Explorer window and select Properties.
- Under C/C++ Build, select Tool Chain Editor. In the Current toolchain drop-down, select your newly imported toolchain and then click Apply to save the changes and click OK.
To compile for an Armv8 target, you must change the included Makefile.
ABI = -marm -mfloat-abi=hardand replace it with
ABI = -march=armv8-a. This flag compiles the code to specifically target the Armv8 architecture. Armv8 compilers no longer need a flag for hardware floating point because, by default, all floating point linkage is done in hardware.
- All mentions of
arm-linux-gnueabihfmust be changed to
- Save the changes you have made to the Makefile.
- To build the project, right click on the project in Project Explorer and select Build Project. This invokes the make all command and creates the Armv8-compatible stripped and unstripped binaries.
Loading the image onto the target
There are two different ways to load the image onto the target, either using a debug configuration in DS-5 or manually using the Remote Systems Explorer (RSE) connection you have created.
Loading the image using a debug configuration
To load the image onto a hardware target, you must edit the debug configuration provided with the project.
- Select Run > Debug Configurations..., then choose the imported debug configurations under DS-5 Debugger in the left hand panel. The debug configuration included in the project used to connect to a hardware target is called
threads-gdbserver-example debug configuration, you will need to change the connection type to allow you to connect to a 64-bit gdbserver.
- In the Connection tab, in the Select target panel, select Linux Application Debug > Application Debug > Connections via AArch64 gdbserver > Download and debug application.
- In the Connections panel, select your hardware Remote Systems Explorer (RSE) connection from the drop down box.
- Click Apply to save these changes, then click Debug to start the debug configuration and load the file onto the target.
- When the dialog box asking you to open the DS-5 debug perspective appears, click Yes to switch to the debugger perspective.
Loading the image onto the target manually
Alternatively, you can load the program binary onto the target without using a debug configuration.
- Drag and drop files from your project file in the Project Explorer window into your home directory on your target in the Remote Systems window.
- Use the command
chmod +x threadsin the target's home directory command line to change the file permissions and allow the program to be executed.
In this tutorial, we will be using Streamline through Eclipse for DS-5. Streamline can also be used purely through the command line. For more information on how this can be done, take a look at the 'Using Streamline in the command line' section of the Arm DS-5 Streamline User Guide.
Setting up a Streamline capture
- If you cannot see the Streamline window, select Window > Show View > Streamline Data.
- To configure your Streamline capture, click the Capture analysis options button .
- In the Connections panel enter the IP address of your target.
- In the Program Images panel, select Add ELF image from Workspace... to include the unstripped executable in your capture. Make sure that the Toggle Symbol Loading is on (the eye appears next to the image file), so that Streamline will read the high-level debug symbols from the executable.
- Click Save to save the options.
If you want to change the events that Streamline will collect, click the Counter Configuration button and choose the events you would like to observe. For this tutorial, we will leave the default event settings.
- To start the capture, click the Start Capture button and give the Streamline session a name, for example Threads_v8 and choose a location to save it.
- Click Save to start the capture.
- Once the capture has started you can run the threads executable using the connected debug configuration, by clicking on the Continue arrow in the Debug Control window or by pressing F8. If you transferred the executable manually, you can run the program from the target’s command line by entering
./threadsin the directory where the executable is saved.
It is possible that gatord will not have automatically started on the target. To start gatord manually, enter
/usr/sbin/gatord & into the target’s command line.
Recording the Streamline capture
When the capture is being recorded, you can watch the data being collected in the Live view. You might notice the red square with the word USR in it. This indicates that Streamline is running in User Space Mode. Streamline can also be run in Kernel Space Mode, which allows you to collect a wider range of data from the target, but for this tutorial, User Space Mode is sufficient. For more information about the differences between user space gator and kernel space gator, take a look at the 'Comparison of user space gator and kernel space gator' in the Arm DS-5 Streamline User Guide.
- When you want to stop the capture, you can use the Stop capture from target and analyse collected data button , which will stop recording and save the data.
- After ending the capture, you can disconnect from any debug configurations you were connected to by double clicking on the debug configuration in the Debug Control window.
Understanding the Streamline interface
The Streamline interface offers five views of the collected data.
In the Timeline view, the events that were selected to be collected can be viewed. These can be used to observe the different performance metrics of the program. The Heat Map view shows which processes were active at a particular time and what percentage of the total instructions they are responsible for. The Core Map and Cluster Map views show which core or cluster of cores was responsible for each process at a particular time. The Processes view shows the percentage of the CPU activity each process is responsible for at the position the Cross-section marker is.
In the Call Paths view, the functions and threads associated with each process are shown and their information can be viewed, as well as a breakdown of the samples collected.
In the Functions view, statistics for specific functions within the source code can be viewed. You can see, among other things, how many samples were taken in each function, how many times a function executed, its size and location within the source code. By right-clicking on a function, you can see it displayed in other tabs.
In the Code view, you are able to view the source code and assembly instructions of the executable and observe the number of samples taken on each line, which shows which lines of code spent the longest time running.
- If you receive a The source file is missing error message, the code hasn’t been found automatically. Click the Click here to locate the file link and locate it manually.
The Log view allows you to see any ANNOTATE statements included in your code. This particular project does not contain any annotations, so this view will remain empty.
More information on the different Streamline views can be found in the Arm DS-5 Streamline User Guide.
Making sense of the Streamline report
threads example project, the Timeline view shows two peaks of activity, at the beginning of the function, when creating the threads, and at the end, when the individual threads re-join the parent thread, print the results and the program exits. Between these two periods of high activity, the threads are 'working' - executing the
thread_work function, which calls the
accumulate function. This function gets the individual threads to complete a number of loops, every time incrementing the floating point starting value,
thread_app_data[t].result, by another floating point,
thread_app_data[t].step. The Call Paths and Functions views show that the target spends most of its time idle, running code from the kernel.
The processes list in the Timeline view can be used to see which cores are being used by which threads. As the default number of threads for this program is 5, a multi-core system such as the Juno can assign each thread to a different core. However, the same core does not necessarily execute the same thread for the duration of the program, so some of the threads are handled by a different core later in the program. The activity of the cores can be seen by clicking on the arrow in the CPU Activity label, once to see the cluster activity and twice to see the activity of the individual cores.
Unsurprisingly, the line of code that takes for the longest time is threads.c:118,
accum = accum + step; which adds the floating point numbers together. This can be seen in the code view. Specifically, the load instruction at
0x00400d80 shown in the disassembly view, which retrieves the
accum variable from memory, is the instruction that takes the longest time. Load instructions can take a long time due to the amount of steps needed to retrieve data from memory. Sometimes up to five steps are required (instruction fetch, register access, passing through the ALU, memory access and register access), but if the data is already in the cache, data can be loaded in fewer steps. The accumulate function is responsible for most of the execution time in the threads executable as the loop executes more than a million times for each thread.
It is not always the case that the instruction with the most samples is the instruction that executes for the longest time. Sometimes a program is stalled on a particular instruction waiting for another instruction to finish. This would lead to a large number of samples to be taken on the instruction the program is stalled on, rather than the instruction it is waiting for with the longer execution time.
This tutorial has provided an overview of how to use Streamline with an Armv8 target and the functions it provides.
Threads is a relatively simple program, but the same principles apply when using Streamline for a more complex example project.
To continue learning about the uses of Streamline, try 'Using Streamline to Guide Cache Optimization' tutorial on the Arm Connected Community.
Another feature you may want to explore is Streamline's annotation feature. More information can be found in the 'Streamline Annotate' section of the Streamline User Guide or by trying the
Streamline_annotate example project in the Linux_examples folder built into the DS-5 release. The Linux_examples directory also contains other Linux example projects that can be analysed using Streamline.