Learn about our research towards developing cutting-edge Machine Learning (ML) techniques on Arm-based technologies. Go to section:
White papers
Hardware design | Optimization | Solutions
Repurposing a CPU, GPU, or DSP is an easy way to add ML capabilities to an edge device. However, where responsiveness or power efficency is critical, a dedicated Neural Processing Unit (NPU) may be the best solution. In this paper, we describe how the Arm Ethos-N77 NPU delivers optimal performance.
DownloadML algorithms are moving to the IoT edge due to latency, power consumption, cost, network bandwidth, reliability, privacy, and security considerations. Read how the open-source CMSIS-NN library can help you maximize the performance of neural network solutions that deploy ML algorithms on low-power Arm Cortex-M cores.
DownloadEmbedded vision enhances solutions in a broad range of markets including automotive, security, medical, and entertainment. In this paper, we explain how to add intelligent vision to your next embedded device.
DownloadArm and its partners are enabling a step-change increase in on-device processing, from tiny microcontrollers to multicore gateways. In this paper, we explain how Arm and its partners are approaching this unique problem.
DownloadAccurate face verification is a challenge due to the number of variables that are involved. In this paper, we look at a new approach that combines classic and modern machine learning techniques. This approach achieves 98.36% accuracy, runs efficiently on Arm ML-optimized platforms, and addresses key security issues.
DownloadWork is ongoing to simplify neural network processing so that more algorithms can run on edge devices. One approach eliminates complexity by replacing floating-point representation with fixed-point representation. In this paper, we take a different approach and recommend a mix of the two representations, to reduce memory and power requirements while retaining accuracy.
DownloadVoice-activated assistants that use keyword spotting have become more widespread. In this paper, we borrow from an approach used for computer vision to create a compact keyword spotting algorithm. This algorithm supports voice-driven commands in edge devices that use a very small, low-power microcontroller.
DownloadIntelligent assistance is becoming vital in our daily lives and the technology is taking a big leap forward. In this paper, Recognition Technologies and Arm provide technical insight into the architecture and design approach that is making the gateway a more powerful, efficient place for voice recognition.
Download2018. Cite: arXiv:1801.06274
Abstract: Machine learning is playing an increasingly significant role in emerging mobile application domains like AR, VR, and ADAS. This means that hardware architects have designed customized hardware for machine learning algorithms, especially neural networks, to improve compute efficiency. However, machine learning is typically just one processing stage in complex end-to-end applications, which involve multiple components in a mobile System-on-Chip (SoC). Focusing on just ML accelerators loses bigger optimization opportunity at the SoC level. This paper argues that hardware architects should expand the optimization scope to the entire SoC. We demonstrate a case study in the domain of continuous computer vision where camera sensor, Image Signal Processor (ISP), memory, and NN accelerator are synergistically co-designed to achieve optimal system-level efficiency.
Download2018. Cite: arXiv:1801.04326
Abstract: Efficient and compact neural network models are essential for enabling deployment on mobile and embedded devices. In this paper, we point out that typical design metrics for gauging the efficiency of neural network architectures – total number of operations and parameters – are not sufficient. These metrics may not accurately correlate with the actual deployment metrics like energy and memory footprint. We show that throughput and energy varies by up to 5X across different neural network operation types on an off-the-shelf Arm Cortex-M7 microcontroller. Furthermore, we show that the memory required for activation data, apart from the model parameters, also need to be considered for network architecture exploration studies.
Download2018. Cite: arXiv:1801.06601
Abstract: Deep Neural Networks are becoming increasingly popular in always-on IoT edge devices. These networks perform data analytics right at the source, reducing latency as well as energy consumption for data communication. This paper presents CMSIS-NN, efficient kernels that are developed to maximize the performance and minimize the memory footprint of Neural Network (NN) applications on Arm Cortex-M processors that are targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.
Download2017. Cite: arXiv:1712.01507
Abstract: Efficient and compact neural network models are essential for enabling the deployment on mobile and embedded devices. In this work, we point out that typical design metrics for gauging the efficiency of neural network architectures – total number of operations and parameters – are not sufficient. These metrics may not accurately correlate with the actual deployment metrics such as energy and memory footprint. We show that throughput and energy varies by up to 5X across different neural network operation types on an off-the-shelf Arm Cortex-M7 microcontroller. Furthermore, we show that the memory required for activation data also need to be considered, apart from the model parameters, for network architecture exploration studies.
Download2017. Cite: arXiv:1703.03073
Abstract: Deep convolutional neural network (CNN) inference requires significant amount of memory and computation, which limits its deployment on embedded devices. To alleviate these problems to some extent, prior research utilize low precision fixed-point numbers to represent the CNN weights and activations. However, the minimum required data precision of fixed-point weights varies across different networks and also across different layers of the same network. In this work, we propose using floating-point numbers for representing the weights and fixed-point numbers for representing the activations. We show that using floating-point representation for weights is more efficient than fixed-point representation for the same bit-width and demonstrate it on popular large-scale CNNs such as AlexNet, SqueezeNet, GoogLeNet and VGG16. We also show that such a representation scheme enables compact hardware multiply-andaccumulate (MAC) unit design. Experimental results show that the proposed scheme reduces the weight storage by up to 36% and power consumption of the hardware multiplier by up to 50%.
Download2018. Cite: arXiv:1709.06161
Abstract: Massive data exist among user local platforms that usually cannot support deep neural network (DNN) training due to computation and storage resource constraints. Cloud-based training schemes provide beneficial services, but suffer from potential privacy risks due to excessive user data collection. To enable cloud-based DNN training while protecting the data privacy simultaneously, we propose to leverage the intermediate representations of the data, which is achieved by splitting the DNNs and deploying them separately onto local platforms and the cloud. The local neural network (NN) is used to generate the feature representations. To avoid local training and protect data privacy, the local NN is derived from pre-trained NNs. The cloud NN is then trained based on the extracted intermediate representations for the target learning task. We validate the idea of DNN splitting by characterizing the dependency of privacy loss and classification accuracy on the local NN topology for a convolutional NN (CNN) based image classification task. Based on the characterization, we further propose PrivyNet to determine the local NN topology, which optimizes the accuracy of the target learning task under the constraints on privacy loss, local computation, and storage. The efficiency and effectiveness of PrivyNet are demonstrated with CIFAR-10 dataset.
Download2017. Cite: arXiv:1711.07128
Abstract: Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy for good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms. Due to its always-on nature, KWS application has highly constrained power budget and typically runs on tiny microcontrollers with limited memory and compute capability. The design of neural network architecture for KWS must consider these constraints. In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers. We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements. We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures. DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNN model with similar number of parameters.
DownloadAnswered | Forum FAQs | 0 votes | 226 views | 0 replies | Started 1 months ago by Annie Cracknell | Answer this |
Answered | How to make Ethos-U NPU work on an ARM Cortex-A + Cortex-M processor? | 0 votes | 8523 views | 35 replies | Latest yesterday by alisonw | Answer this |
Suggested answer | Image Processing | 0 votes | 898 views | 1 replies | Latest 2 days ago by RCameron | Answer this |
Suggested answer | Pytorch framework for Arm NN (CMSIS) | 0 votes | 5250 views | 7 replies | Latest 1 months ago by Arman Gupta | Answer this |
Answered | Forum FAQs Started 1 months ago by Annie Cracknell | 0 replies 226 views |
Answered | How to make Ethos-U NPU work on an ARM Cortex-A + Cortex-M processor? Latest yesterday by alisonw | 35 replies 8523 views |
Suggested answer | Image Processing Latest 2 days ago by RCameron | 1 replies 898 views |
Suggested answer | Pytorch framework for Arm NN (CMSIS) Latest 1 months ago by Arman Gupta | 7 replies 5250 views |