Understanding Arm Bare Metal Benchmark Application Startup Time
Using Initialization Software in CPAKs
The following screenshot shows a bare metal program and simulating execution of an Arm Cortex-A15 processor system. CPAKs come with example systems and initialization software to help shorten the ramp up time for users. This article will explain how to use the initialization code within CPAKs to understand the program.
The program uses the following code at the top of main.c:
- There is an array named memspace with a #define to set the size of the array. When running new software it’s a good idea to understand as much as possible as quickly as possible by getting through the program the first time. One way to do this is to cut down the number of iterations, data size, or any other elements needed to complete the program and gain confidence it’s running correctly. It might also be useful to place a few breakpoints in the software to easily adapt to the program to how it runs.
- Place a breakpoint at the end of the initial assembly code to make sure nothing has gone wrong with the basic setup.
- Next, place a breakpoint at main() to make sure the program gets started, and then stop at interesting looking C functions to track progress. For this particular program shrink the size of the memspace array to 200 bytes for the first pass through the test.
- After understanding the basics of the program, change the array size back to the original value of 200000 bytes. This will result in the simulation taking much longer to get to main() when the array was larger, about 8 times longer as shown in the table below.
Array size |
Cycles to reach main() |
200 |
4860 |
200000 |
39174 |
Moving from Assembly Code
- There are two parts to jumping from the initial assembly code to the main() function in C. First, save the address of __main in r12 as shown below.
- Next, jump to __main at the end of the assembly code by using the BX r12 instruction. After the BX instruction the program goes into a section of code provided by the compiler (for which there is no source to debug) and will come out again at main().
- Copies the execution regions from their load addresses to their execution addresses. This is a memory setup task for the case where the code is not loaded in the location it will run from or if the code is compressed and needs to be decompressed.
- Zeros memory that needs to be cleared based on the C standard that says statically-allocated objects without explicit initializers are initialized to zero.
- Branch to __rt_entry
- Once the memory is ready, the code starting from __rt_entry sets up the runtime environment by doing the following tasks:
- Sets up the stack and heap
- Initializes library functions
- Call main()
- Call exit() after main() completes
The code starting from __main performs the following tasks:
If anything goes wrong between the assembly code and main() the most common cause is the stack and heap setup; take a look at this if your program doesn’t make it to main().
The Linux size command is a good way to confirm the larger array impacts the bss section of the code. The zero initialized (ZI) data and bss refer to the same segment. With the 200 byte array:
-bash-3.2$ size main.axf text data bss dec hex filename 71276 16 721432 792724 c1894 main.axf |
With the 200000 byte array:
-bash-3.2$ size main.axf text data bss dec hex filename 71276 16 921232 992524 f250c main.axf |
Alternatives to Save Simulation Time
One way to avoid executing instructions to write zero to memory that is already zero, is to use linker scripts or compiler directives to put the array into a different section of memory that is not automatically initialized to 0.
One solution is to just skip __main altogether and go directly to __rt_entry:
- To skip __main just replace the load of __main into r12 with a load of __rt_entry into r12. Now when the program runs __main will be skipped altogether.
- Here are the new results with __main skipped.
Array size |
Cycles to reach main() |
200 |
4355 |
200000 |
4362 |
As expected the number of cycles to reach main() is about the same with both array sizes, and much less than zeroing the large array. Although the difference may seem small for the benchmark, the problem is greater when a larger and more complex software program is run.
Another possibility to avoid initializing large global variables is to use a compiler pragma:
- The Arm compiler, armcc, has a section pragma to move the large array into a section which is not automatically initialized to zero. To use it, put the pragma around the array declaration as shown below.
- After putting in the pragma, one more step is needed. The scatter file for the linker must be aware of this new section.
Executing the program with the pragma is a much safer solution, especially when the software is going to write the memory anyway and the initial zero values are not being assumed. With the pragma the number of cycles to reach main is the same with both sizes of the array.
Array size |
Cycles to reach main() |
200 |
4411 |
200000 |
4411 |
The pragma is a good solution if there are a few large arrays that can be found and instrumented with the pragma.
This article was originally written as a blog by Jason Andrews. Read the original post on Connected Community.