Putting our code together
We can now put all of the code together. To do this, we modify the MainActivity.stringFromJNI
method in the following code:
extern "C" JNIEXPORT jstring JNICALL MainActivity.stringFromJNI ( JNIEnv* env, jobject /* this */) { // Ramp length and number of trials const int rampLength = 1024; const int trials = 10000; // Generate two input vectors // (0, 1, ..., rampLength - 1) // (100, 101, ..., 100 + rampLength-1) auto ramp1 = generateRamp(0, rampLength); auto ramp2 = generateRamp(100, rampLength); // Without NEON intrinsics // Invoke dotProduct and measure performance int lastResult = 0; auto start = now(); for(int i = 0; i < trials; i++) { lastResult = dotProduct(ramp1, ramp2, rampLength); } auto elapsedTime = msElapsedTime(start); // With NEON intrinsics // Invoke dotProductNeon and measure performance int lastResultNeon = 0; start = now(); for(int i = 0; i < trials; i++) { lastResultNeon = dotProductNeon(ramp1, ramp2, rampLength); } auto elapsedTimeNeon = msElapsedTime(start); // Clean up delete ramp1, ramp2; // Display results std::string resultsString = "----==== NO NEON ====----\nResult: " + to_string(lastResult) + "\nElapsed time: " + to_string((int)elapsedTime) + " ms" + "\n\n----==== NEON ====----\n" + "Result: " + to_string(lastResultNeon) + "\nElapsed time: " + to_string((int)elapsedTimeNeon) + " ms"; return env->NewStringUTF(resultsString.c_str()); }
The MainActivity.stringFromJNI
method proceeds as follows:
- Create two equal-length vectors using
generateRamp
methods. - Calculate the dot product of those vectors using the non-Neon method
dotProduct
. Repeat this calculation several times (trials constant) and measure the computation time usingmsElasedTime
. - Perform the same operations as in Step 1 and Step 2, but now using the Neon-enabled method
dotProductNeon.
resultsString
. The latter is displayed in theTextView
. To build and run the preceding code successfully, you need an Arm-v7-A or Armv8-A device. The following image shows the improvements that Neon Intrinsics can bring to an application:
Using built-in intrinsics provided a seven percent improvement in elapsed time. A theoretical improvement of 25 percent could be achieved on Arm 64 devices.