Truncation with Neon

In this section of the guide, we look at thresholding using truncation.

When the algorithm is given an input array, it replaces all elements above the predefined threshold, T, with the T value. All items below the threshold are kept unchanged.

To implement this type of algorithm, use a for loop. Then every iteration can invoke the std::min method to check whether the current array value is above the threshold:

#define THRESHOLD 50
int8_t inputSignalTruncate[SIGNAL_LENGTH];
void truncate() {
    for (int i = 0; i < SIGNAL_LENGTH; i++) {
        inputSignalTruncate[i] = std::min(inputSignal[i], (int8_t)THRESHOLD);

In this example, the threshold was set through the corresponding macro to 50. The truncated signal is stored in the inputSignalTruncate global variable.

Because every iteration of the above algorithm is independent, you can easily apply Neon intrinsics. You must split the for loop into several segments. Each segment processes several input elements in parallel.

The number of items that you can handle in parallel depends on the input data type. In our example, the input signal is an array of int8_t elements, so it can process up to 16 items per iteration (see the TRANSFER_SIZE macro in the following code

When using Neon intrinsics, we usually load data from memory to registers, process the registers with Neon SIMD, and then store the results back to memory. Here is how to follow this approach when using truncation:

#define TRANSFER_SIZE 16
void truncateNeon() {
    // Duplicate threshValue
    int8x16_t threshValueNeon = vdupq_n_s8(THRESHOLD);

    for (int i = 0; i < SIGNAL_LENGTH; i += TRANSFER_SIZE) {
    	// Load signal to registers
    	int8x16_t inputNeon = vld1q_s8(inputSignal + i);

    	// Truncate
    	uint8x16_t partialResult = vmin_s8(inputNeon, threshValueNeon);

    	// Store result in the output buffer
    	vst1q_s8(inputSignalTruncate + i, partialResult);
  1. Split the loop into SIGNAL_LENGTH / TRANSFER_SIZE segments.
  2. Duplicate the threshold value to a vector named threshValueNeon using the vdupq_n_s8 Neon function. You can get the list of all available Neon functions here.
    - Load the chunk of an input signal to the registers using the vld1q_s8 method.
    - Calculate the minimum between the threshold and the fragment of the input signal.
    - Store the results back to the inputSignalTruncate array.
  3. To compare the performance of the Neon and non-Neon approach, put the truncate and truncateNeonmethods in the Java-bounded method:

double processingTime;
extern "C"
(JNIEnv *env, jobject thiz, jboolean useNeon) { auto start = now(); #if HAVE_NEON if(useNeon) truncateNeon(); else #endif truncate(); processingTime = usElapsedTime(start); return nativeBufferToByteArray(env, inputSignalTruncate, SIGNAL_LENGTH); }


Measure the processing time with the chrono library. See the companion code.  

Store the measured execution time within the processingTime global variable. It is passed to Java code using the following binding:

extern "C"
    JNIEnv *env, jobject thiz) {

    return processingTime;

To get the results:

  1. Run the app on your device.
  2. Click Generate Signal. This button plots the blue curve.
  3. Click Truncate. A green line appears.

The following screenshot shows the results for our example:

Truncation 3

Check the Use Neon checkbox and tap Truncate again. With Neon intrinsics, the processing time for our example is shortened from 100 microseconds to 6 microseconds, which yields approximately 16 times faster execution. We achieved this without any significant code changes, just slight modifications of the single loop with Neon Intrinsics.

Previous Next