Specifications
Powering innovation in a new world of AI devices
Build low-cost, highly efficient AI solutions in a wide range of embedded devices with the latest addition to the Arm Ethos-U microNPU family. The Ethos-U65 maintains the power efficiency of the Arm Ethos-U55, while extending its applicability to Arm Cortex-A, Cortex-R, and Arm Neoverse-based systems and at the same time delivering twice the on-device machine learning performance.
Key Features | Performance (At 1 GHz) | 512 GOPS/s to 1 TOP/s |
MACs (8x8) | 256, 512 |
|
Utilization on popular networks | Up to 85% | |
Data types | Int-8 and Int-16 | |
Network support | CNN and RNN/LSTM | |
Winograd support | No | |
Sparsity | Yes | |
Memory System | Internal SRAM | 55 to 104 KB |
System interfaces |
Two 128-bit AXI |
|
External on Chip SRAM | KB to Multi-MB | |
Compression | Weights only | |
Memory optimizations | Extended compression, layer and operator fusion | |
Development Platform | Neural frameworks | TensorFlow Lite Micro |
Operating systems | Bare-metal, RTOS, Linux | |
Software components | TensorFlow Lite Micro runtime, CMSIS-NN, Optimizer, driver | |
Debug and profile | Layer-by-layer visibility with PMUs | |
Evaluation and early prototyping | Performance Model, Cycle Accurate Model, or FPGA evaluations |
Key features
Extending performance and efficiency
Unlock new vision and voice use cases in minimum area with 2x performance uplift compared to the Ethos-U55 processor. Reach 1 TOP/s in 0.6mm2 (in 16nm).
Flexible integration
Build low-cost, highly efficient systems with rich OS and DRAM support in Cortex-A and Neoverse systems and on BareMetal or RTOS SRAM/FLASH systems on Cortex-M with the highly successful Ethos-U architecture.
Unified software and tools
Develop, deploy, and debug AI applications with the Arm Endpoint AI solution using a common toolchain across Arm Cortex, Neoverse, and Ethos-U processors.
Enhanced design
- Supports popular networks with extended operator support
- Provides wider AXI interfaces
- Improves reliability with ECC added into internal RAMs
Key benefits
New use cases
Enables demanding AI use cases like object detection and segmentation with 150% higher performance (Inf/s) supporting read and write from DRAM.
Support complex models
Process complex workloads under a rich OS in Cortex-A and Neoverse systems with wider AXI interfaces (128-bit) and DRAM support with average 150% improvement in inf/s for popular networks.
Integrated DMA
Weights and activations are fetched ahead of time using a DMA connected to system memory through an AXI5 master interface.
Energy-efficient
Provides up to 90% energy reduction for ML workloads such as ASR, compared to previous Cortex-M generations.
Future-proof operator coverage
Heavy compute operators run directly on the microNPU, such as:
- Convolution
- LSTM
- RNN
- Pooling
- Activation functions
- Primitive element wise functions
Other kernels run automatically on the tightly coupled Cortex-M using CMSIS-NN.
Offline optimization
Offline compilation and optimization of neural networks, performing operator, and layer fusion. As well as layer reordering, to increase performance and reduce system memory requirements by up to 90%. Delivers increased performance and lower power compared to non-optimized ordering.
Element-wise engine
Designed to optimize for commonly used element-wise operations such as addition, multiplication, and subtraction for commonly used scaling, LSTM, GRU operations. Enables future operators composed of these similar primitive operations.
Mixed precision
Supports:
- Int-8 and Int-16 lower precision for classification and detection tasks
- High-precision Int-16 for audio and limited HDR image enhancements
Lossless compression
Advanced, lossless model compression reduces model size by up to 75%, increasing system inference performance and reducing power.