Vector instruction example

In this section of the guide, we introduce how Helium uses vector instructions when performing parallel arithmetic on vectors using vector registers. We also show how we can use intrinsics to perform parallel arithmetic.

Overview of Vector Multiply Accumulate

The example that we have chosen to show parallel arithmetic is Vector Multiply Accumulate (VMLA). At a high-level, in VMLA each element in the source vector is multiplied by a scalar value. The result is added to the respective element from the destination vector. The results are stored in the destination register.

Let’s look at the following VMLA instruction:

VMLA.S32 VectorOne, VectorTwo, Scalarvalue

This instruction shows that Qda is the accumulator vector register and is the destination register for the entire operation.

VMLA has three inputs and one output. The three inputs are:

  • VectorOne
    • The accumulator vector register
  • VectorTwo
    • The source vector register
  • Scalarvalue
    • The scalar general-purpose register

In the example, the vector registers, VectorOne and VectorTwo, are divided into four lanes of 32-bit values. However, the vector registers could be divided into two 64-bit values, or eight 16-bit values or sixteen 8-bit values, depending on which data types you are operating on. Both VectorOne and VectorTwo must contain the same number of data lanes.

In our example, the inputs:

  • VectorOne contains the numbers 7,1,6,2
  • VectorTwo contains the numbers 5,2,3,6.
  • Scalarvalue contains 2.

When the three inputs have been declared:

  • Scalarvalue is multiped with Qn.
  • VectorTwo is added to Q1.
  • VectorOne is outputted.

That is, VectorOne[i] = VectorOne[i] + VectorTwo[i] * Scalarvalue where i={0..elts-1}

These steps are shown in the following diagram:


Implementation using intrinsics

One way that you could use VMLA in your C code is by using intrinsics. Intrinsics are functions which the compiler understands and replaces with low-level Helium instructions.

First, we declare two arrays to hold the input vectors and an integer variable to hold the scalar value. You can see this in the following code:    

//Declaring the arrays and the scalar value
const int arrayone[] = {5, 2, 3, 6};						
const int arraytwo[] = {7, 1, 6, 2};

int32_t Scalarvalue = 2; 

In this example, we only use four numbers in our vectors. However, in a real-world example more than four numbers may be used. No matter how many numbers are in the array, pointers are needed because the intrinsic specification uses them to access the input and output values. This can be seen in the following code:

//Declaring the pointer value
int32_t *pone = arrayone;
int32_t *ptwo = arraytwo;

When the arrays, and the pointers, have been declared, the values can be loaded into vector registers.

The following code uses intrinsics to load the array values into helium vector registers:

//Loading the 4 values from the array
int32x4_t VectorOne = vld1q_s32 (pone);
int32x4_t VectorTwo = vld1q_s32 (ptwo);

In the following code this performs the multiply and accumulate operation:

int32x4_t Result = vmlaq_n_s32 (VectorTwo, VectorOne, Scalarvalue); 

This is stating Result = VectorOne + (VectorTwo x Scalarvalue).

Here is a complete working example:

#include <stdio.h>
#include <stdlib.h>
#include "arm_mve.h"

int main(void) {
	printf("Program started\n");
       //Declaring arrays
       const int32_t arrayone[] = {5, 2, 3, 6};
       const int arraytwo[] = {7, 1, 6, 2};
       const int inactivearray[] = {4,4,4,4};
       const int m[] = {8,8,8,8};

       //Value that arraytwo is being multiplied by
       int32_t scalarvalue = 2;

       //pointer values for both arrays (need this because otherwise it would print out 	4 every time)
       int32_t *pone = arrayone;
       int *ptwo = arraytwo;

       //Loading the 4 values from the array
       int32x4_t VectorOne = vld1q_s32 (pone);
       int32x4_t VectorTwo = vld1q_s32 (ptwo);

	//The VMLA instruction
       int32x4_t Result = vmlaq_n_s32 (VectorTwo, VectorOne, scalarvalue);

	//Printing the results
       printf("Element 0: %d\n", vgetq_lane_s32 (Result,0));
       printf("Element 1: %d\n", vgetq_lane_s32 (Result,1));
       printf("Element 2: %d\n", vgetq_lane_s32 (Result,2));
       printf("Element 3: %d\n", vgetq_lane_s32 (Result,3));
How this can be done using C code

In the previous sections of this guide, we introduced the multiply-accumulate instruction and how it can be generated from intrinsics. Now let’s look at the multiply-accumulate instruction can be generated from C source-code. We use the following motivating example:

void vmla (int *__restrict Qda, int *__restrict Qn, int Rm, int N) {
for (int i = 0; i<N; i++)
    Qda[i] += Qn[i] * Rm;

The preceding code shows that the C implementation is an almost direct translation of the pseudo-code that is shown in the previous section, Qda = Qda + (Qn x Rm). In the preceding example, the first two arguments of the function VMLA are integer pointers modeling the two input streams. The first one is also the output stream. The arguments are annotated with the __restrict keyword to indicate that streams Qda and Qn do not overlap. For more details on __restrict see Arm Compiler toolchain Compiler Reference. The third argument is the scalar value Rm. The fourth argument is N which determines how many numbers will be processed, In the previous section and example this was 4, here it is a run-time value N.

This example also shows that writing C code has advantages compared to writing intrinsics. C code is more compact, readable, and portable. However, writing C code relies on the compiler to efficiently translate your code into machine instructions. When more fine-grained control of the generated instructions is required, intrinsics might be a better solution.

When this VMLA function is compiled with Arm Compiler 6.14, the following assembly code is generated:

    dlstp.32 lr, r3
    vldrw.u32 q0, [r1], #16
    vldrw.u32 q1, [r12], #16
    vmla.u32 q1, q0, r2
    vstrw.32 q1, [r0]
    mov r0, r12
    letp lr, .LBB0_1

Let’s look first at the generated function body that follows label LBB0_1. There are two load instructions loading 16 bytes in vector registers >q0 and q1. Vector q1 is multiplied by the scalar value in r2, which contains function argument Rm, and accumulated in vector registers q0. The results are then stored to r0, which corresponds to function argument Qda. The first and last instruction in this example are instructions that control the execution of the loop, which we will be discussed in Tail-predication. 

The VMLA instruction uses vector registers, we have shown that we are generating vector code from C-code that is using scalar values and operations with Qda[i] += Qn[i] * Rm. For example, auto-vectorization by the compiler can transform scalar code to vector code in an efficient way. Auto-vectorization is enabled with optimization level –Os and above.

Arm Compiler User Guide: Selecting optimization options helps to transform existing code and serves as an alternative to writing (vector) intrinsics.

In Tail-predication, we discuss the last interesting aspect of the generated code example: the loop control instructions.


In Predication, we mention tail-predication as one of the predication forms introduced in Armv8.1. The assembly code example in How this can be done using C code shows usage of two of these new instructions:

  • DLSTP: Do-Loop-Start, Tail-Predicated
  • LETP: Loop-End, Tail-Predicated

These two instructions are the tail-predication version of the Do-Loop Start (DLS) and Loop-End (LE) instructions and are part of the low-overhead-branches extension that aim to speed-up loop execution. Tail-predication is best illustrated with an example:

for (int i=0; i < 10; i++)
  A[i] += B[i] + C[i];

In this example, the loop iterates ten times, which means that it is processing 10 integer elements. If we can vectorize this code and can pack four integer elements in one vector, we have two vector operations processing eight elements. Because we need to process ten elements, we need a scalar loop (a tail loop), that processes the remaining two elements. In pseudo-code. The vectorized code looks like this:

for (int i=0; i < 8; i+=4)       // the vector loop
  A[i:4] += B[i:4] * C[i:4];
for (int i=8; i < 10; i++)       // the tail-loop
  A[i] += B[i] * C[i];

The vector loop increments with four. It processes four 4 elements at the same time, which is indicated with the i:4 array index notation. The tail loop is the original loop, except that it starts at 8. This means that it executes only the last 2 iterations, so I=8, I=9, the ninth and tenth iterations respectively.

Having both the vector and the tail loop comes at a cost, which is the overhead of executing 2 loops, and code density. Tail-predication solves these problems and allows the execution of these loops. For example, loops that process several elements. Where number of elements being processed are not an exact multiple of the number of elements that fit in a vector, in one single vector loop. In pseudo-code, that looks like this:

    for (int i=0; i < 12; i+=4)       // the vector loop and the tail-loop
        A[i:4] += B[i:4] * C[i:4],       active lane if i<10

The loop bound has been adjusted to 12, and the step size is 4, so that this loop executes 3 iterations. Executing 3 iterations of this vector loop would process 12 elements, while we need to process only 10. For the last iteration, tail-predication means that the last 2 lanes are disabled to make sure we only process these 10 elements and not 12. In the previous pseudo code example, this is indicated with the active lane if i<10 annotation. The Armv8.1-M tail-predication loop instruction solves this in hardware.

Let’s now look again at the assembly output in the How this can be done using C code and the tail-predicated loop. The loop is set up with the following instruction:

dlstp.32 lr, r3

The instruction dlstp sets up a tail-predicated loop, where register lr contains the number of elements to be processed, with its initial value coming from register r3. This makes sense if we remember what the VMLA function prototype looks like:

    void vmla (int *__restrict Qda, int *__restrict Qn, int Rm, int N)

Function argument N corresponds to the number of elements that are processed by the loop, is the fourth function argument and is passed in register r3. After this, the loop body is executed, and we branch back to the beginning with new instruction, which looks like:

letp lr, .LBB0_1

This Loop-End (LE) instruction branches back to label. LBB0_1 and decrements the number of elements to be processed in register lr. It also ensures that, for the last iteration, the right vector lanes are enabled or disabled, for example, it takes care of the tail-predication. Using Arm Compiler 6, tail-predicated loops can be generated from source code or intrinsics.

Previous Next