Calculating dot products using Neon Intrinsics

In this section we look at calculating the dot products using Neon intrinsics. To modify the dotProduct function to benefit from Neon intrinsics, we must split the for loop so that it uses data lanes. This means that we will partition, or vectorize, the loop to operate on sequences of data during a single CPU cycle. These sequences are defined as vectors. However, to distinguish  from the vectors that we use as inputs for the dot product, we call these sequences register vectors.

With register vectors, reduce the loop iterations so that, at every iteration, you multiply, then accumulate, multiple vector elements to calculate the dot product. The number of elements that you can work with depends on the register layout.

The Arm Neon architecture uses a 64-bit or 128-bit register file. In a 64-bit case, you can work with either eight 8-bit, four 16-bit, or two 32-bit elements. In a 128-bit case, you can work with either sixteen 8-bit, eight 16-bit, four 32-bit, or two 64-bit elements.

To represent various register vectors, Neon intrinsics use the following name convention:

In this convention:

<type><size>x<number of lanes>_t

In this convention:

  • <type> is the data type (int, uint, float, or poly).
  • <size> is the number of bits used for the data type (8, 16, 32, 64).
  • <number of lanes> defines how many lanes.

For example, int16x4_t represents a vector register with 4 lanes of 16-bit integer elements, which is equivalent to a four-element int16 one-dimensional array (short[4]).

Do not instantiate Neon intrinsic types directly. Instead, use dedicated methods to load data from the arrays to CPU registers. The names of these methods start with vld. Method naming uses a convention that similar to the one for type naming. All methods start with v, which is followed by a method short name (like ld for load), and the combination of a letter and a number of bits (for example, s16) to specify the input data type.

Neon intrinsics directly correspond to the assembly instructions in the following code.

int dotProductNeon(short* vector1, short* vector2, short len) {
	const short transferSize = 4;
	short segments = len / transferSize;
 
	// 4-element vector of zeros
	int32x4_t partialSumsNeon = vdupq_n_s32(0);
 
	// Main loop (note that loop index goes through segments)
	for(short i = 0; i < segments; i++) {
    		// Load vector elements to registers
    	short offset = i * transferSize;
       	int16x4_t vector1Neon = vld1_s16(vector1 + offset);
    	int16x4_t vector2Neon = vld1_s16(vector2 + offset);
 
    	// Multiply and accumulate: partialSumsNeon += vector1Neon * vector2Neon
    	partialSumsNeon = vmlal_s16(partialSumsNeon, vector1Neon, vector2Neon);
	}
 
	// Store partial sums
	int partialSums[transferSize];
	vst1q_s32(partialSums, partialSumsNeon);
 
	// Sum up partial sums
	int result = 0;
	for(short i = 0; i < transferSize; i++) {
    		result += partialSums[i];
	}
 
	return result;
}

To load data from memory, use the vld1_s16 method. This method loads four elements to the CPU registers from the array of shorts signed 16-bit integers or s16&nbsp;for short,

When the elements are in the CPU registers, add the elements using the vmlal (multiply and accumulate) method. This method adds elements from two arrays and accumulates the result in a third array.

This array is stored within the partialSumsNeon variable. To initialize this variable, use the vdupq_n_s32 (duplicate) method, which sets all CPU registers to the specific value. In this case, the value is 0. It is the vectorized equivalent of writing int sum = 0.

When all the loop iterations complete, store the resulting sums back to memory. The results can be read element by element using vget_lane&nbsp;methods. Alternatively, store the whole vector using vst methods. In this example, we use the second option.

When the partial sums are back in memory, I sum them to get the final result.

On AArch64, you could also use:

return vaddv_s32 (partialSumsNeon);

Then skip the second for loop.

Previous Next