Accessing memory

Data processing and flow control concentrated on data processing and flow control instructions. In this exercise, we show how to access memory with load and store instructions. To do this, we implement our own simple memory copy (memcpy) routine.

Get started

Like with Data processing and flow control, a framework project is provided to get you started. Follow these steps:

  1. Import the 2_memcpy project into the Arm Development Studio.
  2. The imported project then appears in the Project Explorer pane, as you can see in the following screenshot:

    You should see the following files in the memcpy project:

    • startup.s
      This is a simple reset handler. You will not need to modify this file for this exercise. Unlike Data processing and flow control, this startup file includes code to configure and enable the MMU.
    • main.c
      This contains the C main() function, and implements a simple test harness for the function that you will develop.
    • memcpy.s
      This is an A64 assembler file. This file contains an empty function definition that you will complete.
    Implement byte by byte copying
  3. Open memcpy.s.
  4. The following code shows the empty function that we are going to implement:

      .global my_memcpy
        // void my_memcpy(uint8_t* src, uint8_t* dst, uint32_t size_in_bytes)
        .type my_memcpy, @function
        //  ADD YOUR CODE HERE

    The function takes three arguments:

    • src - a pointer to the source buffer, which points to first data item
    • dst - a pointer to the destination buffer, which points to first empty location
    • size_in_bytes - the number of bytes to be copied

    For this exercise, we can assume that the pointers are to memory that is marked as Normal and that strict alignment checking is not enabled. This means that unaligned accesses are permitted. There are several possible approaches to implementing the function. We start with the simplest approach, which is a byte by byte copy. In pseudocode, we can represent this as you can see here:

    	while size_in_bytes greater than 0
    		load byte from src
    		increment src pointer by 1
    		store byte to dst
    		increment dst pointer by 1
    		decrement size_in_bytes by 1
  5. Implement the function, copying one byte at a time.
  6. Here are a few things to consider before getting started:

    • What size are addresses in AArch64?
    • How will you update the pointers after each iteration?
    • What is the syntax for loading a sub-register sized quantity?
    Run the completed image

    Once you have completed the function, you can test it using the Fixed Virtual Platform (FVP) models that are provided with Arm Development Studio.

  7. Right-click on the project and select Build Project to build the project.
  8. As in Data processing and flow control, the Console tab shows the build messages. If the project builds successfully, the output will look like what you can see in this screenshot:

  9. Check for any errors. If there are any, correct them and try to rebuild the project.
  10. When you have successfully built your image, test it using the FVP models. This exercise uses an FVP with a single-core Cortex-A53 processor.

  11. Launch the model using the A64 – memcpy.launch script in the project.
  12. Click the green arrow icon to run the model.
  13. The Target Console tab shows the output from the simulator. The output for a successful run looks like this code:

    terminal_0: Listening for serial connection on port 5000
    terminal_1: Listening for serial connection on port 5001
    terminal_2: Listening for serial connection on port 5002
    terminal_3: Listening for serial connection on port 5003
    CADI server started listening to port 7000
    Info: FVP_Base_Cortex_A53x1: CADI Debug Server started for ARM Models...
    CADI server is reported on port 7000
    Memcpy Workbook: Finished successfully

    Every time you launch the model, you will see a window open, like the one that is shown here:

    This window represents the LCD and switches of the simulated platform. These exercises do not use these features, but there is something else of interest. Total Instr reports the number of instructions that the simulator has executed since it was launched.

  14. Make a note of the instruction count after running your implementation.
  15. Note: Your figure might be different to that shown in the screenshot. The total instruction count 7,453 is based on the reference solution with Arm Compiler 6.12.

    Implement multi-byte copying

    Copying one byte at a time is simple, but inefficient. For most copy operations, we want to transfer more than one byte at a time, so that we can reduce the number of iterations. We might also try to issue multiple loads and stores for each iteration.

    The next step is to modify my_memcpy() to use load and store pair instructions with X registers. This means that 128 bits, not 8 bits, are copied per iteration. The code needs also be able to handle data which is not a multiple of 128 bits in size. Follow these steps:

  16. Modify the my_memcpy() function to use the LDP and STP instructions with X registers for the first iterations. Use smaller accesses for the last few bytes of the data.
  17. Re-run the test program and check the instruction count. Has it changed?
  18. You should see that the instruction count has gone down. This screenshot shows the result using the reference solution:

    Remember that this is the instruction count for the entire program, not just the running of my_memcpy(). But we can see that the more complex implementation has reduced the number of instructions needed to copy the data (7,453 instead of 6,778), at least with this size of buffer. The larger the data set, the bigger the reduction. However, with very small amounts of data, the new implementation might be slower.

  19. Experiment with different sizes of data and different implementations of the copy routine. Consider using the wider floating-point registers.
Previous Next