NEON

Arm NEON technology is an advanced SIMD (single instruction multiple data) architecture extension for the Arm Cortex-A series and Cortex-R52 processors. 

NEON technology was introduced to the Armv7-A and Armv7-R profiles. It is also now an extension to the Armv8-A and Armv8-R profiles. 

NEON technology is intended to improve the multimedia user experience by accelerating audio and video encoding/decoding, user interface, 2D/3D graphics or gaming. NEON can also accelerate signal processing algorithms and functions to speed up applications such as audio and video processing, voice and facial recognition, computer vision and deep learning.

Overview

The NEON technology is a packed SIMD architecture. NEON registers are considered as vectors of elements of the same data type. Multiple data types are supported by the technology. The following table describes data types as supported by the architecture version.

Armv7-A/R Armv8-A/R
Armv8-A  
  AArch32 AArch64
Floating-point 32-bit 16-bit*/32-bit 16-bit*/32-bit/64-bit
Integer 8-bit/16-bit/32-bit 8-bit/16-bit/32-bit/64-bit 8-bit/16-bit/32-bit/64-bit

The NEON instructions perform the same operations in all lanes of the vectors. The number of operations performed depends on the data types. NEON instructions allow up to:

  • 16x8-bit, 8x16-bit, 4x32-bit, 2x64-bit integer operations
  • 8x16-bit*, 4x32-bit, 2x64-bit** floating-point operations

The implementation on NEON technology can also support issue of multiple instructions in parallel.

*Only in Armv8.2-A

**Only in Armv8-A/R 

How to use NEON?

NEON can be used multiple ways, including NEON enabled libraries, compiler's auto-vectorization feature, NEON intrinsics, and finally, NEON assembly code. Detailed information on NEON programming can be found in NEON Programmers Guide.

Libraries

One of the easiest ways to take advantage of NEON is to use an open source library that already makes use of NEON.

Arm Compute Library for Machine Learning and Computer Vision
The Arm Compute Library is a collection of low-level functions optimized for Arm CPU and GPU architectures targeted at image processing, computer vision, and machine learning. More information can be found at: https://developer.arm.com/technologies/compute-library

Ne10
 is an open source C library, hosted on github by Arm, containing a set of the most commonly processing intensive functions heavily optimized for Arm. Ne10 is a modular structure consisting of several smaller libraries. Currently, these include: 

Math functions Signal processing functions Image processing functions Physics functions
Vector Add Floating & Fixed Point Image Resize Collision Detection
Matrix Add Complex-to-Complex FFT Image Rotate 
Vector Subtract Floating & Fixed Point    
Vector Subtract From Real-to-Complex FFT    
Matrix Subtract FIR Filters    
Vector Multiply FIR Decimator    
Vector Multiply-Accumulate FIR Interpolator    
Matrix Multiply FIR Lattice Filters    
Matrix Vector Multiply FIR Sparse Filters    
Vector Divide IIR Lattice Filters     
Vector Set      
Vector Length      
Vector Normalize      
Vector Absolute Value      
Vector Dot Product      
Vector Cross Product      
Matrix Determinant      
Matrix Inverse      
Matrix Transpose      
Matrix Identity    

libyuv is an open source project that includes YUV scaling and conversion functionality.

skia is an open source 2D graphics library used as the graphics engine for Google Chrome and Chrome OS, Android, Mozilla Firefox and Firefox OS, and many other products..

AutoVectorization

The auto-vectorization feature is supported by Arm compilers wherein they exploit NEON functionality automatically.

This feature is supported by:

View the NEON Programmer’s Guide. The Arm Compiler User Guide provides also extra guidance for NEON optimisation. 

Compiler Intrinsics

NEON intrinsics are function calls that the compiler replaces with an appropriate NEON instruction or sequence of NEON instructions. Intrinsics provide almost as much control as writing assembly language, but leave the allocation of registers to the compiler, so that developers can focus on the algorithms. It can also perform instruction scheduling to remove pipeline stalls for the specified target processor. This leads to more maintainable source code than using assembly language. NEON Intrinsics is supported by Arm Compilers, gcc and LLVM.

Arm NEON Intrinsics Reference

NEON intrinsic example

#include <arm_neon.h>
uint32x4_t double_elements(uint32x4_t input)
{
return(vaddq_u32(input, input));
}

For more information, please see Arm NEON Intrinsics Reference document, a reference for the NEON Intrinsics for Armv7 and Armv8 architectures.

Assembly code

For very high performance, hand-coded NEON assembler is the best approach for experienced programmers. Both GNU assembler (gas) and Arm Compiler toolchain assembler (armasm) support assembly of NEON instructions.

Tools

Arm DS-5 Development Studio provides an end-to-end suite of tools for C/C++ software development for Arm-based platform. DS-5 fully supports the NEON architecture from programming and debugging.

DS-5 debugger provides a full debug capabilities of the NEON instructions and visualization of its architectural registers. DS-5 debugger supports all Arm architecture profiles and processors.

DS-5

NEON ecosystem

A wide range of codecs and DSP modules is available from several partners.

Examples of some key available functions are detailed below:

Video codecs: Audio codecs: Voice and speech codecs: Audio enhancement algorithms: Computer Vision Machine and deep learning
VP9 OTT encoder, VP9 Consumer encoder/decoder MP3 encoder/decoder G.711 Echo cancellation Canny Edge detection On-device object recognition
H.264 (AVC) encoder/decoder MPEG-2 layer I & II encoder/decoder G.722, G.722.1, G.722.2 Noise Reduction Harris Corner On-device scene recognition
MPEG4 SP/ASP encoder/decoder MPEG-1 layer III audio encoder G.723.1 Beam Forming ORB Human pose recognition
MPEG2 decoder MPEG-1 layer III audio encoder/decoder G.726 Comfort Noise Convolution filter Defect detection
H.263 decoder HE-AACv1, v2 encoder/decoder G.727 AudioZoom Erosion/Dilation  
  WMA Standard encoder/decoder G.728 Equalization Face detection  
  WMA Pro, WMA Lossless decoder G.729, G.279A, G.729B Wind noise reduction Pedestrian detection  
  SBC Bluetooth encoder/decoder G.729AB Automatic Gain Control Fast9/Fast12 corner detection  
  OggVorbis encoder/decoder AMR Narrowband, Wideband, Wideband+ Voice Activity Detection Object tracking  
  FLAC encoder/decoder GSM-HR, GSM-ER, GSM-EFR Key word spotting Lane departure  
  Dolby® Digital AC-3 encoder/decoder Opus Voice trigger Connected components  
  Dolby® Digital eAC-3 decoder iLBC Voice biometrics    
  Dolby® MS10/MS11 Multistream SILK Speaker verification    
  Dolby® Digital Plus 5.1/7.1 Consumer decoder SPEEX      
  Dolby® Digital 5.1 Creator Consumer encoder MELPe      
  Dolby® Pro Logic I & II encoder/decoder        
  iSAC encoder/decoder        
  CELT encoder/decoder        
  DTS core encoder/decoder        
  DAB+ encoder/decoder        
  Dolby® Mobile encoder/decoder        
  Dolby® TrueHD consumer decoder        
  Dolby® UDC encoder/decoder        

  More details about our eco-system partner can be found at our DSP Ecosystem Partners page.

Find out more

Interested in speaking with someone about licensing Arm technology? Get in touch to speak with one of our sales representatives.