



## arm

Optimizing Power and Performance for Machine Learning at the Edge:

Model Deployment Overview

Arm

Lingchuan Meng, Principal Engineer, Arm Naveen Suda, Principal Engineer, Arm

\_ October 20, 2020

### Al Virtual Tech Talks Series

| Date                 | Title                                                                                         | Host                                              |  |  |  |
|----------------------|-----------------------------------------------------------------------------------------------|---------------------------------------------------|--|--|--|
| October 20,<br>2020  | Optimizing Power and Performance For Machine Learning at the Edge - Model Deployment Overview | Arm                                               |  |  |  |
| November<br>3, 2020  |                                                                                               |                                                   |  |  |  |
| November<br>17, 2020 | The Smart City in Motion - AI in intelligent transportation                                   | Clever Devices, NXP<br>Semiconductor,<br>Arcturus |  |  |  |

### Presenters



Lingchuan Meng, Principal Engineer, Arm



Naveen Suda, Principal Engineer, Arm

## ML on the Edge - Challenges

- Edge device constraints for deploying ML algorithms
  - Limited memory
    - Flash (32 kB few MB)
    - SRAM (16 kB 512 kB)
  - Limited compute capability (100 MHz 1 GHz)
- Hardware/software features
  - Compression HW: pruning, clustering, etc.
  - Mixed precision: 8-bit, 16-bit, etc.
  - Algorithmic: Winograd, etc.
  - Layer fusion: conv-add-pool-relu, etc.





ML solutions = Model Optimization  $\rightarrow$  Software  $\rightarrow$  Hardware



## **End-to-end Technology Exploration**







# Model Deployment Optimizations

**Deployment Optimization Flow** 







## Overview of Technologies





## **Optimization Results**

| Model        | Sparsity | Accuracy Δ |  |  |  |
|--------------|----------|------------|--|--|--|
| Inception V3 | 50%      | +0.1%      |  |  |  |
| ResNet 50    | 50%      | +0.7%      |  |  |  |
| VGG 16       | 50%      | +1.6%      |  |  |  |
| MobileNet V1 | 50%      | -0.9%      |  |  |  |
| Wav2Letter   | 50%      | -1.34%     |  |  |  |
| DS-CNN Large | 80%      | -0.5%      |  |  |  |

unstructured pruning



Lingchuan Meng, et al. "Neural Network Optimizations for On-Device AI" Embedded World Conference (2020).



## Overview of Pruning Techniques

#### Magnitude Pruning





Song Han, et al. "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding" arXiv: 1510.00149 (2015).

#### **Channel Pruning**





Yihui He, et al. "Channel Pruning for Accelerating Very Deep Neural Networks" <u>arXiv: 1707.06168</u> (2017).

#### **Structured Pruning**





Sajid Anwar, et al. "Structured Pruning of Deep Convolutional Neural Networks" arXiv: 1512.08571 (2015). Jeff Pool, "Accelerating Sparsity in the Nvidia Ampere Architecture" GTC 2020

## **Pruning**

- Inducing sparsity to overly-parametrized models
  - Improve model compression and computation efficiency
- Structure
  - Unstructured: irregular locations of zeros
  - Structured: pre-defined patterns of zeros
- Spatial granularity
  - Layer / filter / kernel / weight
  - Compressibility vs. acceleration
- Techniques
  - Magnitude / Variational dropout / Regularization
- Challenges
  - Accuracy degradation
  - High sparsity → better performance?





## Pruning – Key Concepts

#### Pruning schedule

- Pruning induces damages to model
- Increase sparsity gradually
- Strategies: inverse power/linear/cosine

#### Distribution of sparsity

- Not all layers are equal
- Uniform: same sparsity for all layers
- Reinforcement-learning (RL)

#### Hardware-aware hyper-parameter tuning

- Tuning for a single optimization
- Joint tuning for multiple optimizations

#### model





## **Hyper-Parameter Tuning**

#### **Deterministic**

- Uniform
  - Same sparsity for all prunable layers
- Heuristic
  - Per-layer target sparsity:  $\alpha \cdot \log |var_i|$

$$\alpha = \frac{pr \cdot \sum |var_i|}{\sum (|var_i| \cdot \log |var_i|)}$$

Dynamically increase pruning ratio during training

$$\overline{pr} = pr \cdot (1 - \left(1 - \frac{t - t_0}{n\Delta t}\right)^3)$$

Michael Zhu, et al. "To prune, or not to prune: exploring the efficacy of pruning for model compression" <a href="https://arxiv:1710.01878">arXiv: 1710.01878</a> (2017).

Jiaxiang Wu, et al. "PocketFlow: An Automated Framework for Compressing and Accelerating Deep Neural Networks" <a href="https://openreview.net/pdf?id=H1fWoYhdim">https://openreview.net/pdf?id=H1fWoYhdim</a> (2018)

#### **Reinforcement Learning**



Yihui He, et al. "AMC: AutoML for Model Compression and Acceleration on Mobile Devices" arXiv: 1802.03494 (2018).



## Hardware-Aware Hyper-Parameter Tuning

- Optimizing for accuracy + hardware metrics
  - HW metrics: latency, compression, bandwidth ...
  - Multi-objective reward/fitness functions
  - Joint tuning for multiple optimizations
    - Larger search space and better results





## Convolution with Sparse Tensors

- Accelerating sparse matrix multiply
  - Sparse weights + dense activations
  - Dense math primitives → sparse primitives
  - Vector loads of activations
  - Randomly-accessed values cached in L1
  - Prefetching activations to reduce cache misses
  - Block-structured sparsity for additional speedup

|      | Model           | Width      | Top-1               | Mega<br>FLOPs      | Mega<br>Params      | Time<br>SD835    | Time<br>SD670    | Time<br>Wasm     |
|------|-----------------|------------|---------------------|--------------------|---------------------|------------------|------------------|------------------|
| MBv1 | Dense<br>Sparse | 1.0        | 70.9<br><b>72.0</b> | 1120<br><b>268</b> | 4.30<br><b>2.28</b> | 125<br><b>58</b> | 106<br><b>64</b> | 271<br><b>97</b> |
| MBv1 | Dense<br>Sparse | .75<br>1.0 | 68.4<br>68.4        | 636<br><b>146</b>  | 2.59<br><b>1.48</b> | 73<br><b>31</b>  | 64<br><b>34</b>  | 170<br><b>56</b> |



Erich Elsen, et al. "Fast Sparse ConvNets" arXiv: 1911.09723 (2019).



## Algorithmic Optimizations

#### Complex-domain Winograd



|                     | VGG 16 | ResNet 18 | GoogleNet | SqueezeNet |
|---------------------|--------|-----------|-----------|------------|
| Speedup vs NCNN 2x2 | 94.55% | 21.13%    | 12.82%    | 8.86%      |

- Lingchuan Meng, et al. "Efficient Winograd Convolution via Integer Arithmetic" arXiv: 1901.01965
- NCNN: <a href="https://github.com/Tencent/ncnn">https://github.com/Tencent/ncnn</a>

#### 8-bit Winograd





## Quantization

- Storing and computing with tensors at lower bitwidths
  - Typically FP32 -> INT8:  $x_{FP32} = scale \cdot (x_{INT8} zero\_point)$
  - 4x savings in model size and memory bandwidth
  - Inference speedup: 2-4x
  - More aggressive quantization in active research
- Granularity: per-layer vs. per-channel
- Symmetric vs. asymmetric
  - Weights: symmetric with zero\_point=0
  - Activations: asymmetric
- Finding optimal quantization ranges
  - Balancing range vs. resolution
  - Techniques: minimize Quantization error, KL-divergence, etc.







## **Quantization Workflow**





## **Quantizing Activation Nodes**



- Simulate quantization in forward pass.
- Straight-through-estimator (STE) in backward pass during QAT.



## **Mixed-Precision Quantization**

- Some layers are less sensitive to aggressive quantization.
- How to find the optimal bit-width per layer?
- A solution: Sensitivity-based mixed precision quantization
  - Find lowest bit-width without significant accuracy drop.
  - Consider the cascaded effect of quantization error from layer-to-layer.
  - Start from the largest layer, so it is compressed the most.



#### **Mobilenet V2**

- Average bitwidth: 4.5 bits
- 2% accuracy drop (without retraining)
- Fine-tuning recovered 1.5% accuracy



## Clustering: Non-Uniform Quantization

- Non-uniform quantization yields smaller quantization error than uniform quantization.
- Better weight compression especially for large layers.
- K-means clustering algorithm to find initial cluster centroids.
- Fine-tune the cluster centroids with clustering constraints in the graph.
- Preserve sparsity during fine-tuning using sparsity masks.







## **Collaborative Optimizations**









# Model Architecture Optimizations

## **Efficient Network Building Blocks**

- Arithmetic reduction by operator decomposition/approximation
  - Depthwise convolution Mobilenet-V1
  - Inverted bottleneck with residual Mobilenet-V2
- Operator sparsification
  - sparsity → performance?
  - Replace dense ops with sparse ops
- Asymptotically-faster operators
  - Winograd convolution
  - FFT







## Efficiency is Target-Specific

- Hardware utilization matters!
  - Inverted bottleneck conv block
  - Fused inverted bottleneck conv block
    - 1x1 conv + 3x3 depthwise -> 3x3 conv
    - More trainable parameters
    - Better HW utilization (hence a good latency-accuracy trade-off)
  - 3x3 vs. 5x5 convolution
    - 5x5 convolution leads to 2.78x
       increase of MACs and parameters
    - Only a 35% runtime increase
    - Good trade-off for more trainable parameters at a marginal latency cost



Gupta, Suyog, and Berkin Akin. "Accelerator-aware Neural Network Design using AutoML." *arXiv preprint arXiv:2003.02838* (2020).

## Efficiency is More than Conv/FC

- Layer normalization and Gaussian Error
   LU (GELU) impact latency
- Layer normalization replaced by elementwise linear transform
  - $NoNorm(h) = \gamma \circ h + \beta$
- GELU replaced by ReLU



|                            | #Porome #FI OPS |        | LOPS Latency | CoLA | SST-2 | MRPC | STS-B | QQP  | MNLI-m/mm         | QNLI | RTE  | CLUE |
|----------------------------|-----------------|--------|--------------|------|-------|------|-------|------|-------------------|------|------|------|
|                            | πi ai ailis     | #FLOIS | Latency      | 8.5k | 67k   | 3.7k | 5.7k  | 364k | MNLI-m/mm<br>393k | 108k | 2.5k | GLUE |
| MobileBERT <sub>TINY</sub> | 15.1M           | 3.1B   | 40 ms        | 46.7 | 91.7  | 87.9 | 80.1  | 68.9 | 81.5/81.6         | 89.5 | 65.1 | 75.8 |
| MobileBERT                 | 25.3M           | 5.7B   | 62 ms        | 50.5 | 92.8  | 88.8 | 84.4  | 70.2 | 83.3/82.6         | 90.6 | 66.2 | 77.7 |
| MobileBERT w/o OPT         | 25.3M           | 5.7B   | 192 ms       | 51.1 | 92.6  | 88.8 | 84.8  | 70.5 | 84.3/ <b>83.4</b> | 91.6 | 70.4 | 78.5 |

Lower is better

Higher is better

© 2020 Arm Limited

## Hardware-aware Neural Architecture Search (NAS)

- Diversity in ML Hardware CPUs, GPUs and NPUs
  - Different programming models, compute capabilities, memory organization
  - Same neural network architecture cannot map efficiently across multiple HW
- Awareness of the target HW architecture by NN architectures
  - Hand tuning model architectures
  - Automated neural architecture search (NAS)
- Search objectives
  - Latency proxies: # MACs, # parameters (NASNet)
  - Hardware performance model (MNASNet, ProxylessNAS, FBNet)
- State-of-the-art models searched by NAS
  - MobileNet v3, FBNet, EfficientNet, MobileBERT



## NAS – Key Concepts



#### Search space

- Chain-structured
- Architecture template
- Cell-based
- Search strategy
  - Reinforcement learning
  - Gradient-based methods
  - Evolutionary algorithms
  - Random search with rejection sampling
- Performance estimation
  - Accuracy estimation
  - Latency estimation



## NAS Search Space



Chain-structured

Architecture template

Cell-based

arm

## **Summary**

- Resource constraints and diverse SW + HW features require co-development
- Model optimizations bridge the gap between models + SW + HW
- Efficient building blocks and architectures for higher accuracy and performance
- NAS efficiently navigates in design spaces to automate model design
- New opportunities in joint optimization of NAS, pruning, and quantization





## Questions?



## Thank you!

Tweet us: <u>@ArmSoftwareDev</u>

Check out our Arm YouTube <u>channel</u> and our Arm Software Developers
 YouTube <u>channel</u>

Signup now for our next Al Virtual Tech Talk here

Don't forget to fill out our survey to be in with a chance of winning an Arduino

Nano 33 BLE board



## arm

Thank You Danke Merci 谢谢 ありがとう Gracias

> \* Kiitōs ·사합니다 धन्यवाद

> > شکِرًا

תודה