Data processing and flow control concentrated on data processing and flow control instructions. In this exercise, we show how to access memory with load and store instructions. To do this, we implement our own simple memory copy (
Like with Data processing and flow control, a framework project is provided to get you started. Follow these steps:
- Import the 2_memcpy project into the Arm data side.
The imported project then appears in the Project Explorer pane, as you can see in the following screenshot:
You should see the following files in the memcpy project:
This is a simple reset handler. You will not need to modify this file for this exercise. Unlike Data processing and flow control, this startup file includes code to configure and enable the MMU.
This contains the C
main()function, and implements a simple test harness for the function that you will develop.
This is an A64 assembler file. This file contains an empty function definition that you will complete.
Implement byte by byte copying
The following code shows the empty function that we are going to implement:
.global my_memcpy // void my_memcpy(uint8_t* src, uint8_t* dst, uint32_t size_in_bytes) .type my_memcpy, @function my_memcpy: // // // ADD YOUR CODE HERE // // RET
The function takes three arguments:
src- a pointer to the source buffer, which points to first data item
dst- a pointer to the destination buffer, which points to first empty location
size_in_bytes- the number of bytes to be copied
For this exercise, we can assume that the pointers are to memory that is marked as Normal and that strict alignment checking is not enabled. This means that unaligned accesses are permitted. There are several possible approaches to implementing the function. We start with the simplest approach, which is a byte by byte copy. In pseudocode, we can represent this as you can see here:
while size_in_bytes greater than 0 load byte from src increment src pointer by 1 store byte to dst increment dst pointer by 1 decrement size_in_bytes by 1
Here are a few things to consider before getting started:
- What size are addresses in AArch64?
- How will you update the pointers after each iteration?
- What is the syntax for loading a sub-register sized quantity?
Run the completed image
Once you have completed the function, you can test it using the Fixed Virtual Platform (FVP) models that are provided with Arm Development Studio.
As in Data processing and flow control, the Console tab shows the build messages. If the project builds successfully, the output will look like what you can see in this screenshot:
When you have successfully built your image, test it using the FVP models. This exercise uses an FVP with a single-core Cortex-A53 processor.
The Target Console tab shows the output from the simulator. The output for a successful run looks like this code:
terminal_0: Listening for serial connection on port 5000 terminal_1: Listening for serial connection on port 5001 terminal_2: Listening for serial connection on port 5002 terminal_3: Listening for serial connection on port 5003 CADI server started listening to port 7000 Info: FVP_Base_Cortex_A53x1: CADI Debug Server started for ARM Models... CADI server is reported on port 7000 Memcpy Workbook: Finished successfully
Every time you launch the model, you will see a window open, like the one that is shown here:
This window represents the LCD and switches of the simulated platform. These exercises do not use these features, but there is something else of interest. Total Instr reports the number of instructions that the simulator has executed since it was launched.
Note: Your figure might be different to that shown in the screenshot. The total instruction count 7,453 is based on the reference solution with Arm Compiler 6.12.
Implement multi-byte copying
Copying one byte at a time is simple, but inefficient. For most copy operations, we want to transfer more than one byte at a time, so that we can reduce the number of iterations. We might also try to issue multiple loads and stores for each iteration.
The next step is to modify
my_memcpy() to use load and store pair instructions with X registers. This means that 128 bits, not 8 bits, are copied per iteration. The code needs also be able to handle data which is not a multiple of 128 bits in size. Follow these steps:
my_memcpy()function to use the LDP and STP instructions with X registers for the first iterations. Use smaller accesses for the last few bytes of the data.
You should see that the instruction count has gone down. This screenshot shows the result using the reference solution:
Remember that this is the instruction count for the entire program, not just the running of
my_memcpy(). But we can see that the more complex implementation has reduced the number of instructions needed to copy the data (7,453 instead of 6,778), at least with this size of buffer. The larger the data set, the bigger the reduction. However, with very small amounts of data, the new implementation might be slower.