Nota AI - A Hardware-aware Approach for Designing Neural Models

Since modern AI chipsets have different strategies for efficient operations, most neural network models may not be sufficiently optimized for these devices in terms of latency and memory footprint.

In this talk, we present how we make popular neural models be efficiently deployed on Ethos-U65, a newly launched Micro NPU. To this end, we first examine various operation forms (e.g., convolution types and filter size) and identify suitable operations to improve the accuracy-latency trade-off.

Based on this investigation, we carefully redesign well-known convolutional blocks (e.g., inverted residual blocks and ghost blocks) and use these blocks to replace computationally inefficient blocks in given models. We demonstrate that the model variants obtained by our approach can significantly reduce inference time as well as memory budget without noticeable performance drops on Ethos-U65.