Specifications

Ethos-U55 is a new class of machine learning (ML) processor, called a microNPU or micro neural processing unit, designed to accelerate ML inference in area-constrained embedded and IoT devices. When paired with the Cortex-M55 processor, it provides a 480x uplift in ML performance over previous generation Cortex-M processors.

Arm Ethos-U55 enables powerful embedded ML Inference, with partner configurable options allowing a fast time to market.

Key Features Performance (At 1 GHz)  64 to 512 GOP/s 
MACs (8x8)  32, 64, 128, 256 
Utilization on popular networks Up to 85%
Data Types Int-8 and Int-16
Network Support CNN and RNN/LSTM
Winograd Support No
Sparsity Yes
Memory System Internal SRAM 18 to 50 KB
External on Chip SRAM KB to Multi-MB
Compression Weights only
Memory Optimizations Extended compression, layer/operator fusion
Development Platform Neural Frameworks TensorFlow Lite Micro
Operating Systems RTOS or bare-metal
Software Components TensorFlow Lite Micro runtime, CMSIS-NN, Optimizer, driver
Debug and Profile Layer-by-layer visibility with PMUs
Evaluation and Early Prototyping Performance Model, Cycle Accurate Model, or FPGA evaluations

Key features
  • Partner Configurable: Multiple configurations allow designers to rapidly target a wide variety of AI applications with up to 480x increase in performance.
  • Extremely Small Area: Ethos-U55 delivers up to 90% energy reduction in about 0.1mm2 for AI applications in cost-sensitive and energy-constrained devices.
  • Single toolchain: A unified toolchain for Ethos-U55 and Cortex-M eases developer use and creation of AI applications.
  • Future Proof: Native support for the most common ML network operations including CNN and RNN while also allowing for future ML innovations

Key benefits
  • Provides up to 90% energy reduction for ML workloads like ASR compared to previous Cortex-M generations.
  • Flexible design supports various popular neural networks, including CNNs and RNNs, for audio processing, speech recognition, image classification, and object detection
  • Heavy compute operators run directly on the micro NPU such as convolution, LSTM, RNN, pooling, activation functions, and primitive element wise functions. Other kernels that are not composed of these run automatically on the tightly coupled Cortex-M using CMSIS-NN.
  • Up to 85% utilization of MAC engines on popular networks.
  • Offline compilation and optimization of neural networks, performing operator and layer fusion and well as layer reordering to increase performance and reduce system memory requirements up to 90%. Delivers increased performance and lower power compared to non-optimized ordering.
  • Weights and activations are fetched ahead of time using a DMA connected to system memory VIA AXI5 master interface.
  • Designed to optimize for commonly used element wise operations such as add, mul, and sub for commonly used scaling, LSTM, GRU operations. Enables future operators that are composed of these similar primitive operations.
  • Supports Int-8 and Int-16: lower precision for classification and detection tasks; high precision Int-16 for audio and limited HDR image enhancements.
  • Advanced, lossless model compression reduces model size up to 75%, increasing system inference performance and reducing power.
  • Works seamlessly with Cortex-M4, Cortex-M7, Cortex-M33 and Cortex-M55 processors, and can be integrated on an expansion interface that is built into the Corstone-300 reference design. This allows designers to configure and build high-performance, power-efficient SoCs, while further differentiating with Arm processors and their own IP elements, integrated via industry-standard interfaces.