Appendix A - Example solutions

This section gives solutions to this set of exercises. There is more than one way of implementing the exercises, so your own solution might look different and still be correct.

Data processing and flow control solution

The GCD algorithm that is shown in the flow chart in Data processing and flow control can be directly implemented in A64, as you can see in this code:

    CMP    w0, w1         // Compare a and b
    B.EQ   end            // If they are equal, skip to the end
    B.LS   less_than      // If unsigned less than, branch to b = b -a
    SUB    w0, w0, w1     // a = a - b
    B      gcd            // Branch back to start
    SUB    w1, w1, w0     // b = b - a
    B      gcd            // Branch back to start

The code is using the LS condition, which equates to Unsigned lower or same. Alternatively, we could have checked for the Unsigned higher (HI) condition.

There are also Signed Greater Than (GT) and Signed Less Than (LS) conditions. These conditions are not used, because the code is treating the passed-in values as being unsigned. However, with the single set of test values that is used in the test program, you would get the same result.

Another way to implement the GCD algorithm is to use the conditional select instructions, as you can see in this code:

    SUBS   w2, w0, w1       // tmp = a - b, with ALU flag update
    CSEL   w0, w2, w0, HI   // IF "unsigned higher" THEN a = tmp ELSE a = a
    CSNEG  w1, w1, w2, HI   // IF "unsigned higher" THEN b = b ELSE neg(tmp)
    B.NE   gcd              // Branch back to start

This solution is more efficient, because it uses fewer branches. Instead, the conditional select instructions are used to select the correct new value for a and b on each iteration.

Access memory solution

For the byte by byte copy shown in Accessing memory, here is a simple implementation:

    CBZ    w2, end             // Check for number of bytes being 0
    LDRB   w3, [x0], #1        // Load byte[n] from src, post-incrementing pointer
    STRB   w3, [x1], #1        // Store byte[n] to dst, post-incrementing pointer
    SUBS   w2, w2, #1          // Decrement number of bytes, updating ALU flags
    B.NE   my_memcpy           // Branch if number of bytes remaining not 0

As discussed in Accessing memory, copying one byte at a time is inefficient. Here is a possible solution for multi-byte copying:

    // Loop until there is less than 16-bytes of data left
    CMP    w2, #15
    B.LS   my_memcpy_last_15_bytes
    LDP    x3, x4, [x0], #16
    STP    x3, x4, [x1], #16
    SUB    w2, w2, #16
    B      my_memcpy

    // Loop until there is less that 4-bytes of data left
    CMP    w2, #3
    B.LS   my_memcpy_last_3_bytes
    LDR    w3, [x0], #4
    STR    w3, [x1], #4
    SUB    w2, w2, #4
    B      my_memcpy_last_15_bytes

    // Copy the last remaining bytes (3 or fewer)
    CBZ    w2, my_memcpy_end
    LDRB   w3, [x0], #1
    STRB   w3, [x1], #1
    SUB    w2, w2, #1
    B      my_memcpy_last_3_bytes


This implementation improves on the previous code by using larger registers, and by copying multiple registers for each iteration. We could extend the code further by doing multiple LDP and STPs instructions per iteration and using the wider Q registers for the operations. In part, how we optimized the function might depend on what we expect the typical data size to be.

If you experiment with the C library memcpy(), you will see that the Arm-provided library provides multiple implementations. The compiler will attempt to select the one that is most appropriate to the context. For this exercise, modifying main() to call memcpy() results in the following code:

0x0000000080001804:    LDP      q0,q1,[x8,#0]
0x0000000080001808:    ADRP     x9,{pc}+0x4000 ; 0x80005808
0x000000008000180C:    ADD      x9,x9,#0x110
0x0000000080001810:    LDUR     q2,[x8,#0x4c]
0x0000000080001814:    LDP      q4,q3,[x8,#0x30]
0x0000000080001818:    STP      q0,q1,[x9,#0]
0x000000008000181C:    LDR      q1,[x8,#0x20]
0x0000000080001820:    MOV      w10,#0xbeef
0x0000000080001824:    MOV      x1,xzr
0x0000000080001828:    MOVK     w10,#0xdead,LSL #16
0x000000008000182C:    STUR     q2,[x9,#0x4c]
0x0000000080001830:    STP      q4,q3,[x9,#0x30]
0x0000000080001834:    STR      w10,[x8,#0x5c]
0x0000000080001838:    STR      q1,[x9,#0x20]

Note: You can get the disassembly of an ELF image or object by double-clicking on the file within the Project Explorer tab and then selecting Disassembly.

In this instance, the compiler has optimized the output by fully inlining the code that is needed to perform the copy operation. The compiler could do this because the size of the copied data and the source and destination were both known at compile time.

This output was generated using Arm Compiler 6.12. The exact output for different compiler versions might vary.

System control solution

The System control recreates the startup code that is used in Data processing and flow control. Here is some code from the startup.s file that is provided with the GCD project:

  .type start64, @function

  // Check which core is running
  // ----------------------------
  // Core should continue to execute
  // All other cores should be put into sleep (WFI)
  MRS      x0, MPIDR_EL1
  UBFX     x1, x0, #32, #8     // Extract Aff3
  BFI      w0, w1, #24, #8     // Insert Aff3 into bits [31:24], so that [31:0] 
                               // is now Aff3.Aff2.Aff1.Aff0
                               // Using w register means bits [63:32] are zeroed
  CBZ      w0, primary_core    // If, branch to code for primary core
  WFI                          // If not, then go to sleep
  B        1b

Aff2, Aff1 and Aff0 are stored consecutively in bits [23:0]. However, Aff3 is store in bits [39:32], with other fields in bits[31:24]. This register layout makes comparison more difficult. The code is extracting aff3 from the upper half of the register, then inserting it into bits [31:21] to make all the affinity fields consecutive. 

The core with affinity will branch to the primary_core label and continue with the rest of the start-up code. The other cores will execute the WFI instruction and go into Standby mode. If the cores are inadvertently woken from standby, there is a simple loop to capture them.

The code to initialize the floating-point traps is shown here:

// Disable trapping of CPTR_EL2 accesses or use of Adv.SIMD/FPU
  // -------------------------------------------------------------
  MSR      CPTR_EL3, xzr      // Write 0, clearing all trap bits

  // The effect of changes to the system registers are
  // only guaranteed to be visible after a context
  // synchronization event.  See the Barriers guide

We want to clear all the trap bits to 0. The simplest way to do this is to write the zero-register, XZR, into the system register.

For installing the vector table, use this code:

  // Install EL3 vector table
  // -------------------------
  LDR      x0, =vector_table
  MSR      VBAR_EL3, x0

The solution uses the LDR instruction, but ADR would also have worked.

Previous Next