Helium assembly code

Most programmers should not need to write their own Helium assembly code. Relying on an auto-vectorizing compiler, using Helium-enabled libraries, or using Helium intrinsics should all be considered in preference to hand-crafting assembly code.

However, for marginal cases, hand-coded Helium assembly code can be an alternative approach for experienced programmers.

Introduction to mixed C and assembly code programming

To code a function in Helium assembly language, we use two source files:

  • A .s file containing the assembly code function
  • A .c file containing the C code that calls the assembly code function

We then call the assembly function from our C code.

So that the C code can call the assembly function, the following requirements must be met:

  • The calling C code must:
    • Declare the external assembly function using the extern keyword
  • The called assembly function must:
    • Declare itself as a global function, using the .globl and .type directives
    • Conform to the Procedure Call Standard for the Arm Architecture (AAPCS), for example using registers R0 through R3 for input parameters, and R0 for any return value

For example, the following assembly code in myadd.s implements a function myadd():

     .globl   myadd
     .p2align 2
     .type    myadd,%function

myadd:                     // Function "myadd" entry point.
     add      r0, r0, r1   // Arguments in R0 and R1. Add and put the result in R0.
     bx       lr           // Return by branching to the address in the link register.

The following C code in myadd.c then calls the assembly function myadd():

#include <stdio.h>

extern int myadd(int a, int b);

int main()
        int a = 4;
        int b = 5;
        printf("Adding %d and %d results in %d\n", a, b, myadd(a, b));
        return (0);

This example code can be compiled with Arm Compiler 6 as follows:

armclang --target=arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp myadd.c myadd.s

For more information, see the following resources:

Mixed C and assembly code Helium example

The following example uses Helium assembly code to implement a function my_maximum_u32(). This function computes the maximum value of an array of data.

First we will look at the C code that calls the function, then we will look at the assembly code that implements the function.

The C code in max.c does the following:

  • Declares the my_maximum_u32()function as external using the extern keyword.
    The specification of the function shows that it takes the following arguments:
    • pSrc, a pointer to the input data array
    • blkCnt, the number of data elements in the array
    • pResult, a pointer to the integer result which will hold the maximum value on return
  • Creates an array of integer data, and a pointer to that array.
  • Creates an integer variable for the result, and a pointer to that variable.
  • Calculates the number of elements in the array by comparing the size of the array with the size of the integer data type.
  • Calls the my_maximum_u32()function.
  • Prints the result.

The example code is as follows:

#include <stdio.h>
extern int my_maximum_u32(unsigned int * pSrc, 
 				  unsigned int blkCnt, 
 				  unsigned int * pResult);
int main()
  // Setup data in input array.
  unsigned int A_src[] = {1, 1, 3, 4, 2, 1, 1, 3, 8, 6, 2, 6, 6, 1, 3, 2, 9, 1};
  unsigned int  *pA_src = A_src;

  // Setup result variable
  unsigned int result;
  unsigned int *pResult = &result;

  // Get the number of elements in the array
  unsigned int numElements = sizeof(A_src) / sizeof(int);

  // Call the maximum function
  my_maximum_u32(pA_src, numElements, pResult);

  // Print the result
  printf("Maximum = %d\n", result);


The assembly code in max.s does the following:

  • Defines the label my_maximum_u32 as a global function using the .globl and .type directives.
  • On entry to the function, pushes the link register (LR) to the stack to preserve it for later.
    We need to preserve the link register because we need to use it for the low overhead loop that is coming soon.
    The instruction also pushes the register R7 to the stack. We do not need to preserve R7, but the easiest way to preserve the required 8-byte stack alignment is to push and pop registers in pairs.
  • vmov initializes the Helium Q0 vector register to 4 x 32-bit lanes, and sets all 4 lanes to zero.
  • wlstp initializes a loop, setting LR to the total number of data elements as specified by the function argument in register R1.
  • vldrw loads the four lanes of vector register Q0, with the next four values pointed to by the data pointer address in R0. We use the post-increment variant of the instruction to increment the data pointer by 16 (4 x 32 bits = 16 bytes), so that on the next loop iteration we will read the next four data values.
  • vmax compares each of the four lanes in Q1 to the corresponding lane in Q0, and stores the maximum for each lane in Q0.
    On the first loop, because we initialized all lanes of Q0 to zero, any nonzero values in Q1 become the new maximum values.
    On subsequent loops, lanes in Q1 are only updated when new maximum values are discovered.
  • letp marks the end of the loop that is initialized by wlstp, exiting the loop when LR equals zero.
    On each iteration, letp decrements the loop counter in LR by the number of elements in the vector.
    For all iterations except the last, four elements are processed, so that the LR is decremented by four.
    On the final iteration, there may be fewer than four elements to process. In this case, tail predication is used to deal with the data residual when the number of elements to be processed is not an exact multiple of the number of elements in the vector.
  • After all data elements have been processed, vmaxv finds the maximum value across all lanes of vector register Q0, storing that maximum value in R0.
  • Finally, the maximum value is stored in the address that is specified by the parameter in R2. The function returns by popping the previously preserved LR value into the Program Counter PC.

The example assembly code is as follows:

my_maximum_u32:                    // Function "my_maximum_u32" entry point.
     push       {r7, lr}           // Save LR
     vmov.i32   q0, #0x0           // Initialize Q0 
     wlstp.32   lr, r1, end        // Loop Start... 
     vldrw.u32  q1, [r0], #16      //    Load 4 values into vector and increment pointer
     vmax.u32   q0, q1, q0         //    Find maximum
     letp       lr, loop           // ...Loop End
     movs       r0, #0             // Initialize return value to zero
     vmaxv.u32  r0, q0             // Get maximum across all lanes in result vector
     str        r0, [r2]           // Store at address of result pointer
     pop        {r7, pc}           // Return by setting PC to saved LR

This example code can be compiled with Arm Compiler 6 as follows:

armclang --target=arm-arm-none-eabi -march=armv8.1-m.main+mve.fp+fp.dp max.c max.s
Previous Next