White papers

To learn more about machine learning on Arm, see our range of available white papers:

  • How to add intelligent vision to your next embedded product


    Embedded vision has been proven to enhance solutions in a broad range of markets including automotive, security, medical and entertainment. Read this white paper to learn how to add intelligent vision to your next embedded device.

    Read more

  • How to migrate intelligence from the cloud to the device

    How to migrate intelligence from the cloud to the device header banner.
    Understand how Arm and its partners are enabling a step-change increase in on-device processing, whether that’s tiny microcontrollers or multicore gateways.

    Read more

  • Machine Learning on Arm Cortex-M Microcontrollers

    Machine learning (ML) algorithms are moving to the IoT edge due to various considerations such as latency, power consumption, cost, network bandwidth, reliability, privacy and security. Hence, there is an increasing interest in developing Neural Network (NN) solutions to deploy them on low-power edge devices such as the Arm Cortex-M microcontroller systems. CMSIS-NN is an open-source library of optimized software kernels that maximize NN performance on Cortex-M cores with minimal memory footprint overhead.

    Download paper
  • Deploying Always-on Face Unlock: Integrating Face Identification, Anti-Spoofing, and Low-Power Wakeup

    Deploying Always-on Face Unlock Hero Image

    Accurate face verification has long been considered a challenge due to the number of variables, ranging from lighting to pose and facial expression. This white paper looks at a new approach – combining classic and modern machine learning (deep learning) techniques – that achieves 98.36% accuracy, running efficiently on Arm ML-optimized platforms, and addressing key security issues such as multi-user verification, as well as anti-spoofing.

    Read more
  • Packing Neural Networks into End-User Client Devices: How Number Representation Shrinks the Footprint

    Most of today’s neural networks can only run on high-performance servers. There’s a big push to change this and simplify network processing to the point where the algorithms can run on end-user client devices. One approach is to eliminate complexity by replacing floating-point representation with fixed-point representation. We take a different approach, and recommend a mix of the two, so as to reduce memory and power requirements while retaining accuracy.

    Read more
  • The Power of Speech

    Supporting Voice-Driven Commands in Small, Low-Power Microcontrollers. 
    Borrowing from an approach used for computer vision, we created a compact keyword spotting algorithm that supports voice-driven commands in edge devices that use a very small, low-power microcontroller. 

    Read more

  • The New Voice of the Embedded Intelligent Assistant

    As intelligent assistance is becoming vital in our daily lives, the technology is taking a big leap forward. Recognition Technologies and Arm have published a white paper that provides technical insight into the architecture and design approach that’s making the gateway a more powerful, efficient place for voice recognition. 

    Read more

Research papers

Learn about our research towards developing cutting-edge machine learning techniques on Arm-based technologies:

  • Mobile Machine Learning Hardware at Arm: A Systems-on-Chip (SoC) Perspective

    2018. Cite: arXiv:1801.06274

    Abstract: Machine learning is playing an increasingly significant role in emerging mobile application domains such as AR/VR, ADAS, etc. Accordingly, hardware architects have designed customized hardware for machine learning algorithms, especially neural networks, to improve compute efficiency. However, machine learning is typically just one processing stage in complex end-to-end applications, which involve multiple components in a mobile Systems-on-a-chip (SoC). Focusing on just ML accelerators loses bigger optimization opportunity at the system (SoC) level. This paper argues that hardware architects should expand the optimization scope to the entire SoC. We demonstrate one particular case-study in the domain of continuous computer vision where camera sensor, image signal processor (ISP), memory, and NN accelerator are synergistically co-designed to achieve optimal system-level efficiency.

    Download

  • Not All Ops Are Created Equal!

    2018. Cite: arXiv:1801.04326

    Abstract: Efficient and compact neural network models are essential for enabling the deployment on mobile and embedded devices. In this work, we point out that typical design metrics for gauging the efficiency of neural network architectures – total number of operations and parameters – are not sufficient. These metrics may not accurately correlate with the actual deployment metrics such as energy and memory footprint. We show that throughput and energy varies by up to 5X across different neural network operation types on an off-the-shelf Arm Cortex-M7 microcontroller. Furthermore, we show that the memory required for activation data also need to be considered, apart from the model parameters, for network architecture exploration studies.

    Download
  • CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs

    2018. Cite: arXiv:1801.06601

    Abstract: Deep Neural Networks are becoming increasingly popular in always-on IoT edge devices performing data analytics right at the source, reducing latency as well as energy consumption for data communication. This paper presents CMSIS-NN, efficient kernels developed to maximize the performance and minimize the memory footprint of neural network (NN) applications on Arm Cortex-M processors targeted for intelligent IoT edge devices. Neural network inference based on CMSIS-NN kernels achieves 4.6X improvement in runtime/throughput and 4.9X improvement in energy efficiency.

    Download
  • PrivyNet: A Flexible Framework for Privacy-Preserving Deep Neural Network Training

    2018. Cite: arXiv:1709.06161

    Abstract: Massive data exist among user local platforms that usually cannot support deep neural network (DNN) training due to computation and storage resource constraints. Cloud-based training schemes provide beneficial services, but suffer from potential privacy risks due to excessive user data collection. To enable cloud-based DNN training while protecting the data privacy simultaneously, we propose to leverage the intermediate representations of the data, which is achieved by splitting the DNNs and deploying them separately onto local platforms and the cloud. The local neural network (NN) is used to generate the feature representations. To avoid local training and protect data privacy, the local NN is derived from pre-trained NNs. The cloud NN is then trained based on the extracted intermediate representations for the target learning task. We validate the idea of DNN splitting by characterizing the dependency of privacy loss and classification accuracy on the local NN topology for a convolutional NN (CNN) based image classification task. Based on the characterization, we further propose PrivyNet to determine the local NN topology, which optimizes the accuracy of the target learning task under the constraints on privacy loss, local computation, and storage. The efficiency and effectiveness of PrivyNet are demonstrated with CIFAR-10 dataset.

    Download
  • Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

    2017. Cite: arXiv:1712.01507

    Abstract: Efficient and compact neural network models are essential for enabling the deployment on mobile and embedded devices. In this work, we point out that typical design metrics for gauging the efficiency of neural network architectures – total number of operations and parameters – are not sufficient. These metrics may not accurately correlate with the actual deployment metrics such as energy and memory footprint. We show that throughput and energy varies by up to 5X across different neural network operation types on an off-the-shelf Arm Cortex-M7 microcontroller. Furthermore, we show that the memory required for activation data also need to be considered, apart from the model parameters, for network architecture exploration studies.

    Download
  • Hello Edge: Keyword Spotting on Microcontrollers

    2017. Cite: arXiv:1711.07128

    Abstract: Keyword spotting (KWS) is a critical component for enabling speech based user interactions on smart devices. It requires real-time response and high accuracy for good user experience. Recently, neural networks have become an attractive choice for KWS architecture because of their superior accuracy compared to traditional speech processing algorithms. Due to its always-on nature, KWS application has highly constrained power budget and typically runs on tiny microcontrollers with limited memory and compute capability. The design of neural network architecture for KWS must consider these constraints. In this work, we perform neural network architecture evaluation and exploration for running KWS on resource-constrained microcontrollers. We train various neural network architectures for keyword spotting published in literature to compare their accuracy and memory/compute requirements. We show that it is possible to optimize these neural network architectures to fit within the memory and compute constraints of microcontrollers without sacrificing accuracy. We further explore the depthwise separable convolutional neural network (DS-CNN) and compare it against other neural network architectures. DS-CNN achieves an accuracy of 95.4%, which is ~10% higher than the DNN model with similar number of parameters.

    Download
  • Deep Convolutional Neural Network Inference with Floating-point Weights and Fixed-point Activations

    2017. Cite: arXiv:1703.03073

    Abstract: Deep convolutional neural network (CNN) inference requires significant amount of memory and computation, which limits its deployment on embedded devices. To alleviate these problems to some extent, prior research utilize low precision fixed-point numbers to represent the CNN weights and activations. However, the minimum required data precision of fixed-point weights varies across different networks and also across different layers of the same network. In this work, we propose using floating-point numbers for representing the weights and fixed-point numbers for representing the activations. We show that using floating-point representation for weights is more efficient than fixed-point representation for the same bit-width and demonstrate it on popular large-scale CNNs such as AlexNet, SqueezeNet, GoogLeNet and VGG16. We also show that such a representation scheme enables compact hardware multiply-andaccumulate (MAC) unit design. Experimental results show that the proposed scheme reduces the weight storage by up to 36% and power consumption of the hardware multiplier by up to 50%.

    Download

Developer material

Arm is creating the tools you need to bring your best solutions to market. Find out more below.

View

How-to guides

Learn to develop machine learning applications using Arm-based products and tools with our how-to guides.

View

Webinars

Discover tips and techniques for your Arm-based machine learning projects with our growing bank of webinars.

View