C and C++ both use the stack intensively. For example, the stack is used to hold:
the return address of functions
registers that must be preserved, as determined by the ARM Architecture Procedure Call Standard (AAPCS), for instance, when register contents are saved on entry into subroutines
local variables, including local arrays, structures, unions, and in C++, classes.
Some stack usage is not obvious, such as:
Local integer or floating point variables are allocated stack memory if they are spilled (that is, not allocated to a register).
Structures are normally allocated to the stack. A space equivalent to
sizeof(struct)padded to a multiple of four bytes is reserved on the stack. The compiler tries to allocate structures to registers instead.
Arrays are always allocated to the stack. Again, a space equivalent to
sizeof(struct)padded to a multiple of four bytes is reserved on the stack.
Several optimizations can introduce new temporary variables to hold intermediate results. The optimizations include: CSE elimination, live range splitting and structure splitting. The compiler tries to allocate these temporary variables to registers. If not, it spills them to the stack.
Generally, 16-bit Thumb code makes more use of the stack than ARM code and 32-bit Thumb code, because 16-bit Thumb code has only eight registers available for allocation, compared to fourteen for ARM code and 32-bit Thumb code.
The AAPCS mandates that some function arguments are passed through the stack instead of the registers, depending on their type, size, and order.
Stack use is difficult to estimate because it is code dependent, and can vary between runs depending on the code path that the program takes on execution. However, it is possible to manually estimate the extent of stack utilization using the following methods:
--callgraphto produce a static callgraph. This shows information on all functions, including stack use.
--info=summarystackto list the stack usage of all global symbols.
Use the debugger to set a watchpoint on the last available location in the stack and see if the watchpoint is ever hit.
Running your program under a debug monitor like a Real-Time System Model (RTSM), in DS-5 Debugger or RealView Debugger, has a severe performance penalty, because the watched address is checked for every instruction. Using DSTREAM or RealView ICE and RealView Trace has no such penalty.
Use the debugger, and:
Allocate space in memory for the stack that is much larger than you expect to require.
Fill the stack space with copies of a known value, for example,
Run your application, or a fixed portion of it. Aim to use as much of the stack space as possible in the test run. For example, try to execute the most deeply nested function calls and the worst case path found by the static analysis. Try to generate interrupts where appropriate, so that they are included in the stack trace.
After your application has finished executing, examine the stack space of memory to see how many of the known values have been overwritten. The space has garbage in the used part and the known values in the remainder.
Count the number of garbage values and multiply by four, to give their size, in bytes.
The result of the calculation shows how the size of the stack has grown, in bytes.
Use RTSM, and define a region of memory where access is not allowed directly below your stack in memory, with a map file. If the stack overflows into the forbidden region, a data abort occurs, which can be trapped by the debugger.
In general, you can lower the stack requirements of your program by:
writing small functions that only require a small number of variables
avoiding the use of large local structures or arrays
avoiding recursion, for example, by using an alternative algorithm
minimizing the number of variables that are in use at any given time at each point in a function
using C block scope and declaring variables only where they are needed, so overlapping the memory used by distinct scopes.
The use of C block scope involves declaring variables only where they are required. This minimizes use of the stack by overlapping memory required by distinct scopes.
Code performance is optimized by locating the stack in fast
(zero wait-state), on-chip, 32-bit RAM. The ARM (
and Thumb (
POP) stack access
instructions both push and pop a number of 32-bit registers on or
off the stack. If the stack is in 32-bit memory, each register access
takes one cycle. However, if the stack is in 16-bit memory then
each register access takes two cycles, reducing overall performance.
- Other information