Arm Machine Learning Processor 

Industry-leading performance and efficiency for inference at the edge.

Based on a new, class-leading architecture, the Arm ML processor's optimized design enables new features, enhances user experience and delivers innovative applications for a wide array of market segments including mobile, IoT, embedded, automotive, and infrastructure. It provides a massive uplift in efficiency compared to CPUs, GPUs and DSPs through efficient convolution, sparsity and compression. 

Download the datasheet

Key Features

  • Specially designed to provide outstanding performance for mobile with 4 TOP/s and efficiency of 5 TOP/W; additional optimizations provide a further increase in real-world use cases.
  • Programmable layer engines for future-proofing.
  • Incorporates a variety of compression technologies to minimize system memory bandwidth.
  • Highly tuned for advanced geometry implementations.
  • Supports secure operating mode to protect DNN IP and data.
  • High responsiveness reduces latency to improve user experience.
  • Supports TrustZone system security for secure operating mode and configurable secure queues for multiple users, flexible processing in the TEE or SEE for secure cases like biometric payment, protecting content for high-value media streams.
 
 

Key Benefits

  • Enables ML processing on the edge, saving power, reducing data consumption and enhancing user privacy.
  • Flexible design supports a variety of popular neural networks, including CNNs and RNNs, for classification, object detection, image enhancements, speech recognition and natural language understanding.
  • Winograd accelerates common filters by 225% compared to other NPUs, allowing more performance in less area.
  • Minimizes system memory bandwidth by 1.5-3x through a variety of compression technologies, targeting both weight and activation feature maps.
  • Tight system integration through ACE-Lite master port and optional SMMU integration allows for support and protection of memory and easy handling of multiple users.
  • The Arm Machine Learning processor is compatible with Arm NN, an inference engine for CPUs, GPUs and NPUs that bridges the gap between existing NN frameworks and the underlying IP.
Machine Learning Processor Block Diagram.

Find out more

Find out more about Arm Machine Learning processor

Contact us

To learn more about Machine Learning on Arm, visit our ML Developer community.

Learn more

 


Specifications

Key Features Performance
(at 1Ghz)
4 TOP/s

Data Types
Int-8 and Int-16

Network Support
CNN and RNN

Efficient Convolution Winograd support

Sparsity Yes

Secure Mode TEE or SEE

Multi-Core Capability 8 NPUs in a cluster
64 NPUs in a mesh
Memory System Embedded SRAM 1MB

Bandwidth Reduction Extended compression technology, layer/operator fusion

Main Interface 1xAXI4 (128-bit), ACE-5 Lite
Development Platform Neural Frameworks TensorFlow, TensorFlow Lite, Caffe2, PyTorch, MXNet, ONNX

Neural Operator API ArmNN, AndroidNN

Software Components ArmNN, neural compiler, driver and support library

Debug and Profile Layer-by-layer visibility

Evaluation and Early Prototyping Arm Juno FPGA systems and cycle models

 


Applications

Mobile

AR/VR

IoT

Smart camera

Healthcare

Medical

Logistics

Start of internet connection.

STB/DTV

Robotics

Home

Consumer 

Drones

A stack of servers.

Infrastructure

Get Support

Arm Support

Arm training courses and on-site system-design advisory services enable licensees to efficiently integrate the Arm ML processor into their design to realize maximum system performance with lowest risk and fastest time-to-market.

Arm training courses  Arm Design Reviews  Open a support case

Community Forums

Not answered Cache Coherence Support in CHI Specification 0 votes 81 views 0 replies Started 15 hours ago by JO16 Answer this
Not answered Pros and cons of activating cache in stm32F7 0 votes 67 views 0 replies Started 18 hours ago by Marzi Answer this
Not answered making physical memory pages not cacheable (probabaly by modifying page table entry) 0 votes 328 views 0 replies Started yesterday by Gol Answer this
Suggested answer Debug Connection Cause ExecutionTiming Problem on Second Core of Cortex A9 on Zynq 702 MPCore
  • System on Chip (SoC)
  • Cortex-A9
0 votes 1577 views 3 replies Latest yesterday by BurakSeker Answer this
Suggested answer How to access memory more than 4GB by using 32bit ISA?
  • Memory Access Instructions
0 votes 362 views 4 replies Latest yesterday by smithclarkson001 Answer this
Not answered PendSV target secure state 0 votes 68 views 0 replies Started 2 days ago by Jiameng Answer this
Not answered Cache Coherence Support in CHI Specification Started 15 hours ago by JO16 0 replies 81 views
Not answered Pros and cons of activating cache in stm32F7 Started 18 hours ago by Marzi 0 replies 67 views
Not answered making physical memory pages not cacheable (probabaly by modifying page table entry) Started yesterday by Gol 0 replies 328 views
Suggested answer Debug Connection Cause ExecutionTiming Problem on Second Core of Cortex A9 on Zynq 702 MPCore Latest yesterday by BurakSeker 3 replies 1577 views
Suggested answer How to access memory more than 4GB by using 32bit ISA? Latest yesterday by smithclarkson001 4 replies 362 views
Not answered PendSV target secure state Started 2 days ago by Jiameng 0 replies 68 views