Learn about our research towards developing cutting-edge Machine Learning (ML) techniques on Arm-based technologies.

Jump to section:

White papers | Research papers

White papers

Powering the edge: driving optimal performance withEthos-N77 Processor

Repurposing a CPU, GPU, or DSP is an easy way to add ML capabilities to an edge device. However, where responsiveness or power efficency is critical, a dedicated Neural Processing Unit (NPU) may be the best solution. In this paper, we describe how the Arm Ethos-N77 NPU delivers optimal performance.

Download

How to add intelligent vision to your next embedded product

Embedded vision enhances solutions in a broad range of markets including automotive, security, medical, and entertainment. In this paper, we explain how to add intelligent vision to your next embedded device.

Download

How to migrate intelligence from the cloud to the device

Arm and its partners are enabling a step-change increase in on-device processing, from tiny microcontrollers to multicore gateways. In this paper, we explain how Arm and its partners are approaching this unique problem.

Download

Machine learning on Arm Cortex-M Microcontrollers

ML algorithms are moving to the IoT edge due to latency, power consumption, cost, network bandwidth, reliability, privacy, and security considerations. Read how the  open-source CMSIS-NN library can help you maximize the performance of neural network solutions that deploy ML algorithms on low-power Arm Cortex-M cores.

Download

Deploying always-on face unlock: integrating face identification, anti-spoofing, and low-power wakeup

Accurate face verification is a challenge due to the number of variables that are involved. In this paper, we look at a new approach  that combines classic and modern machine learning techniques. This approach achieves 98.36% accuracy, runs efficiently on Arm ML-optimized platforms, and addresses key security issues.

Download

Packing neural networks into end-user client devices: how number representation shrinks the footprint

Work is ongoing to simplify neural network processing so that more algorithms can run on edge devices. One approach eliminates complexity by replacing floating-point representation with fixed-point representation. In this paper, we take a different approach and recommend a mix of the two representations, to reduce memory and power requirements while retaining accuracy.

Download

The power of speech

Voice-activated assistants that use keyword spotting have become more widespread. In this paper, we borrow from an approach used for computer vision to create a compact keyword spotting algorithm. This algorithm supports voice-driven commands in edge devices that use a very small, low-power microcontroller.

Download

The new voice of the embedded intelligent assistant

Intelligent assistance is becoming vital in our daily lives and the technology is taking a big leap forward. In this paper, Recognition Technologies and Arm provide technical insight into the architecture and design approach that is making the gateway a more powerful, efficient place for voice recognition.

Download

Research papers

Mobile machine learning hardware at Arm: a systems-on-chip perspective

2018. Cite: arXiv:1801.06274

Abstract: Machine learning is playing an increasingly significant role in emerging mobile application domains like AR, VR, and ADAS. This means that hardware architects have designed customized hardware for machine learning algorithms, especially neural networks, to improve compute efficiency. However, machine learning is typically just one processing stage in complex end-to-end applications, which involve multiple components in a mobile System-on-Chip (SoC). Focusing on just ML accelerators loses bigger optimization opportunity at the SoC level. This paper argues that hardware architects should expand the optimization scope to the entire SoC. We demonstrate a case study in the domain of continuous computer vision where camera sensor, Image Signal Processor (ISP), memory, and NN accelerator are synergistically co-designed to achieve optimal system-level efficiency.

Download

Not all ops are created equal!

2018. Cite: arXiv:1801.04326

Abstract: Efficient and compact neural network models are essential for enabling deployment on mobile and embedded devices. In this paper, we point out that typical design metrics for gauging the efficiency of neural network architectures – total number of operations and parameters – are not sufficient. These metrics may not accurately correlate with the actual deployment metrics like energy and memory footprint. We show that throughput and energy varies by up to 5X across different neural network operation types on an off-the-shelf Arm Cortex-M7 microcontroller. Furthermore, we show that the memory required for activation data, apart from the model parameters, also need to be considered for network architecture exploration studies.

Download

CMSIS-NN: efficient neural network kernels for Arm Cortex-M CPUs

2018. Cite: arXiv:1801.06601

Abstract: Deep Neural Networks are becoming increasingly popular in always-on IoT edge devices. These networks perform data analytics right at the source, reducing latency as well as energy consumption for data communication. This paper presents CMSIS-NN, efficient kernels that are developed to maximize the performance and minimize the memory footprint of Neural Network (NN) applications on Arm Cortex-M processors that are targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.

Download

PrivyNet: a flexible framework for privacy-preserving deep neural network training

2018. Cite: arXiv:1709.06161

Abstract: Massive data exist among user local platforms that usually cannot support deep neural network (DNN) training due to computation and storage resource constraints. Cloud-based training schemes provide beneficial services, but suffer from potential privacy risks due to excessive user data collection. To enable cloud-based DNN training while protecting the data privacy simultaneously, we propose to leverage the intermediate representations of the data, which is achieved by splitting the DNNs and deploying them separately onto local platforms and the cloud. The local neural network (NN) is used to generate the feature representations. To avoid local training and protect data privacy, the local NN is derived from pre-trained NNs. The cloud NN is then trained based on the extracted intermediate representations for the target learning task. We validate the idea of DNN splitting by characterizing the dependency of privacy loss and classification accuracy on the local NN topology for a convolutional NN (CNN) based image classification task. Based on the characterization, we further propose PrivyNet to determine the local NN topology, which optimizes the accuracy of the target learning task under the constraints on privacy loss, local computation, and storage. The efficiency and effectiveness of PrivyNet are demonstrated with CIFAR-10 dataset.

Download

Bit fusion: bit-level dynamically composable architecture for accelerating deep neural networks

2017. Cite: arXiv:1712.01507

Abstract: Efficient and compact neural network models are essential for enabling the deployment on mobile and embedded devices. In this work, we point out that typical design metrics for gauging the efficiency of neural network architectures – total number of operations and parameters – are not sufficient. These metrics may not accurately correlate with the actual deployment metrics such as energy and memory footprint. We show that throughput and energy varies by up to 5X across different neural network operation types on an off-the-shelf Arm Cortex-M7 microcontroller. Furthermore, we show that the memory required for activation data also need to be considered, apart from the model parameters, for network architecture exploration studies.

Download

Hello edge: keyword spotting on microcontrollers

2017. Cite: arXiv:1711.07128

Abstract: Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy for good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms. Due to its always-on nature, KWS application has highly constrained power budget and typically runs on tiny microcontrollers with limited memory and compute capability. The design of neural network architecture for KWS must consider these constraints. In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers. We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements. We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures. DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNN model with similar number of parameters.

Download

Deep convolutional neural network inference with floating-point weights and fixed-point activations

2017. Cite: arXiv:1703.03073

Abstract: Deep convolutional neural network (CNN) inference requires significant amount of memory and computation, which limits its deployment on embedded devices. To alleviate these problems to some extent, prior research utilize low precision fixed-point numbers to represent the CNN weights and activations. However, the minimum required data precision of fixed-point weights varies across different networks and also across different layers of the same network. In this work, we propose using floating-point numbers for representing the weights and fixed-point numbers for representing the activations. We show that using floating-point representation for weights is more efficient than fixed-point representation for the same bit-width and demonstrate it on popular large-scale CNNs such as AlexNet, SqueezeNet, GoogLeNet and VGG16. We also show that such a representation scheme enables compact hardware multiply-andaccumulate (MAC) unit design. Experimental results show that the proposed scheme reduces the weight storage by up to 36% and power consumption of the hardware multiplier by up to 50%.

Download