Powering innovation in a new world of AI devices
Build low-cost, highly efficient AI solutions in a wide range of embedded devices with the latest addition to the Arm Ethos-U microNPU family. The Ethos-U65 maintains the power efficiency of the Arm Ethos-U55, while extending its applicability to Arm Cortex-A, Cortex-R, and Arm Neoverse-based systems and at the same time delivering twice the on-device machine learning performance.
|Key Features||Performance (At 1 GHz)||512 GOPS/s to 1 TOP/s
|MACs (8x8)||256, 512
|Utilization on popular networks||Up to 85%|
|Data types||Int-8 and Int-16|
|Network support||CNN and RNN/LSTM|
|Memory System||Internal SRAM||55 to 104 KB
||Two 128-bit AXI
|External on Chip SRAM||KB to Multi-MB|
|Memory optimizations||Extended compression, layer and operator fusion|
|Development Platform||Neural frameworks||TensorFlow Lite Micro|
|Operating systems||Bare-metal, RTOS, Linux|
|Software components||TensorFlow Lite Micro runtime, CMSIS-NN, Optimizer, driver|
|Debug and profile||Layer-by-layer visibility with PMUs|
|Evaluation and early prototyping||Performance Model, Cycle Accurate Model, or FPGA evaluations|
Extending performance and efficiency
Unlock new vision and voice use cases in minimum area with 2x performance uplift compared to the Ethos-U55 processor. Reach 1 TOP/s in 0.6mm2 (in 16nm).
Build low-cost, highly efficient systems with rich OS and DRAM support in Cortex-A and Neoverse systems and on BareMetal or RTOS SRAM/FLASH systems on Cortex-M with the highly successful Ethos-U architecture.
Unified software and tools
Develop, deploy, and debug AI applications with the Arm Endpoint AI solution using a common toolchain across Arm Cortex, Neoverse, and Ethos-U processors.
- Supports popular networks with extended operator support
- Provides wider AXI interfaces
- Improves reliability with ECC added into internal RAMs
New use cases
Enables demanding AI use cases like object detection and segmentation with 150% higher performance (Inf/s) supporting read and write from DRAM.
Support complex models
Process complex workloads under a rich OS in Cortex-A and Neoverse systems with wider AXI interfaces (128-bit) and DRAM support with average 150% improvement in inf/s for popular networks.
Weights and activations are fetched ahead of time using a DMA connected to system memory through an AXI5 master interface.
Provides up to 90% energy reduction for ML workloads such as ASR, compared to previous Cortex-M generations.
Future-proof operator coverage
Heavy compute operators run directly on the microNPU, such as:
- Activation functions
- Primitive element wise functions
Other kernels run automatically on the tightly coupled Cortex-M using CMSIS-NN.
Offline compilation and optimization of neural networks, performing operator, and layer fusion. As well as layer reordering, to increase performance and reduce system memory requirements by up to 90%. Delivers increased performance and lower power compared to non-optimized ordering.
Designed to optimize for commonly used element-wise operations such as addition, multiplication, and subtraction for commonly used scaling, LSTM, GRU operations. Enables future operators composed of these similar primitive operations.
- Int-8 and Int-16 lower precision for classification and detection tasks
- High-precision Int-16 for audio and limited HDR image enhancements
Advanced, lossless model compression reduces model size by up to 75%, increasing system inference performance and reducing power.