Powering innovation in a new world of AI devices

Build low-cost, highly efficient AI solutions in a wide range of embedded devices with the latest addition to the Arm Ethos-U microNPU family. The Ethos-U65 maintains the power efficiency of the Arm Ethos-U55, while extending its applicability to Arm Cortex-A, Cortex-R, and Arm Neoverse-based systems and at the same time delivering twice the on-device machine learning performance.

Cortex-A based system for Ethos-U65 block diagram

Cortex-m based system for Ethos-U65 block diagram

Key Features Performance (At 1 GHz)  512 GOPS/s to 1 TOP/s
MACs (8x8)  256, 512
Utilization on popular networks Up to 85%
Data types Int-8 and Int-16
Network support CNN and RNN/LSTM
Winograd support No
Sparsity Yes
Memory System Internal SRAM 55 to 104 KB
System interfaces
Two 128-bit AXI
External on Chip SRAM KB to Multi-MB
Compression Weights only
Memory optimizations Extended compression, layer and operator fusion
Development Platform Neural frameworks TensorFlow Lite Micro
Operating systems Bare-metal, RTOS, Linux
Software components TensorFlow Lite Micro runtime, CMSIS-NN, Optimizer, driver
Debug and profile Layer-by-layer visibility with PMUs
Evaluation and early prototyping Performance Model, Cycle Accurate Model, or FPGA evaluations

Key features

Extending performance and efficiency
Unlock new vision and voice use cases in minimum area with 2x performance uplift compared to the Ethos-U55 processor. Reach 1 TOP/s in 0.6mm2 (in 16nm).

Flexible integration
Build low-cost, highly efficient systems with rich OS and DRAM support in Cortex-A and Neoverse systems and on BareMetal or RTOS SRAM/FLASH systems on Cortex-M with the highly successful Ethos-U architecture.

Unified software and tools
Develop, deploy, and debug AI applications with the Arm Endpoint AI solution using a common toolchain across Arm Cortex, Neoverse, and Ethos-U processors.

Enhanced design

  • Supports popular networks with extended operator support
  • Provides wider AXI interfaces
  • Improves reliability with ECC added into internal RAMs

Key benefits

New use cases
Enables demanding AI use cases like object detection and segmentation with 150% higher performance (Inf/s) supporting read and write from DRAM.

Support complex models
Process complex workloads under a rich OS in Cortex-A and Neoverse systems with wider AXI interfaces (128-bit) and DRAM support with average 150% improvement in inf/s for popular networks.

Integrated DMA
Weights and activations are fetched ahead of time using a DMA connected to system memory through an AXI5 master interface.

Provides up to 90% energy reduction for ML workloads such as ASR, compared to previous Cortex-M generations.

Future-proof operator coverage
Heavy compute operators run directly on the microNPU, such as:

  • Convolution
  • LSTM
  • RNN
  • Pooling
  • Activation functions
  • Primitive element wise functions

Other kernels run automatically on the tightly coupled Cortex-M using CMSIS-NN.

Offline optimization
Offline compilation and optimization of neural networks, performing operator, and layer fusion. As well as layer reordering, to increase performance and reduce system memory requirements by up to 90%. Delivers increased performance and lower power compared to non-optimized ordering.

Element-wise engine
Designed to optimize for commonly used element-wise operations such as addition, multiplication, and subtraction for commonly used scaling, LSTM, GRU operations. Enables future operators composed of these similar primitive operations.

Mixed precision

  • Int-8 and Int-16 lower precision for classification and detection tasks
  • High-precision Int-16 for audio and limited HDR image enhancements

Lossless compression
Advanced, lossless model compression reduces model size by up to 75%, increasing system inference performance and reducing power.