# arm Research

# TinyML Model Design

2nd On-Device Intelligence Workshop @ MLSys 2021



Igor Fedorov Arm ML Research Lab April 9, 2021

### Tiny Hardware

- ~50 billion MCU chips shipped in '19
  - ~100 million GPUs in '18
- Severe memory limitations
  - Limited flash memory —> limited model size
  - Limited SRAM —> limited feature map size
- LeNet for MNIST
  - 420 KB flash
  - 12 KB SRAM

Table 1: Processors for ML inference: estimated characteristics to indicate the relative capabilities.

| Processor                   | Usecase  | Compute        | Memory      | Power | Cost   |
|-----------------------------|----------|----------------|-------------|-------|--------|
| Nvidia 1080Ti GPU           | Desktop  | 10 TFLOPs/Sec  | 11 GB       | 250 W | \$700  |
| Intel i9-9900K CPU          | Desktop  | 500 GFLOPs/Sec | 256 GB      | 95 W  | \$499  |
| Google Pixel 1 (Arm CPU)    | Mobile   | 50 GOPs/Sec    | 4 GB        | ~5 W  | _      |
| Raspberry Pi (Arm CPU)      | Hobbyist | 50 GOPs/Sec    | 1 <b>GB</b> | 1.5 W | _      |
| Micro Bit (Arm MCU)         | ΙοΤ      | 16 MOPs/Sec    | 16 KB       | ~1 mW | \$1.75 |
| Arduino Uno (Microchip MCU) | IoT      | 4 MOPs/Sec     | 2 KB        | ~1 mW | \$1.14 |



- "Bonsai is not compared to deep convolutional neural networks as they have not yet been demonstrated to fit on such tiny IoT devices" [2]
- "Consider a typical IoT device that has ≤ 32kB RAM and a 16MHz processor. Most existing ML models cannot be deployed on such tiny devices" [3]



<sup>[1]</sup> Fedorov et al., SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers. NeurIPS '19

<sup>[2]</sup> Kumar et al. Resource-efficient machine learning in 2 kb ram for the internet of things. ICML '17

<sup>[3]</sup> Gupta et al. Protonn: Compressed and accurate knn for resource-scarce devices. ICML '17

# Can we design the model *for* the device?

Deep learning training software

- Tensorflow
- Pytorch
- etc.

Deployment tool

- TFlite-micro [4]
- TinyEngine [5]



Research papers









- Are all of the operators supported?
- All of compute graphs / structures?





What features are supported?

What are the compute resources required?

publicly

available?

code

- [4] David et al. TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. MLSys '21
- [5] Lin et al. MCUNet: Tiny Deep Learning on IoT Devices. NeurIPS '21

## Algorithmic Tools

- Quantization
  - Int8 cheaper than float (storage + compute)
  - Sub int8 not supported by HW
- Pruning
  - Structured pruning
    - HW friendly
    - Reduces ops
    - Limited compression benefits
  - Unstructured / random pruning
    - Not HW friendly unless extreme
    - Large compression benefits

- Neural architecture search
  - Operators, connectivity, layer width, resolution, etc.
  - Computationally demanding
  - Not all computational graphs supported by deployment tools
- Learn from the HW directly
  - HW model, or
  - HW interface



#### Quantization

- Float —> int8 (weights + activations)
  - 4x reduction in model size
  - 4x reduction in feature map size
  - Cheaper computation
- Sub 8-bit and non-uniform not supported by HW
- $Q(w) = s \times round\left(\frac{clip(w, -w_{max}, w_{max})}{s}\right)$  $s = \frac{w_{max}}{2^{8-1}-1} \quad [6]$
- How to select  $w_{max}$ ? Depends...

- If training data is available
  - Post-training calibration [7]
  - Quantization aware training [8]
    - Treat  $W_{max}$  as a variable and optimize by GD on the task objective
    - Two dependencies on  $W_{max}$
    - Straight through estimator
- If no training data, still have options[9]



[9] Nagel et al., Data-Free Quantization Through Weight Equalization and Bias Correction. ArXiv '19

<sup>[7]</sup> https://www.tensorflow.org/lite/performance/post training quantization

# Pruning

- Unstructured [10] vs structured [11]
  - Can the HW benefit from highly compressed weights?
- Sparsity promoting regularization —> drive the weights to 0
  - How to pick number of non-zeros per layer?
  - Reinforcement learning [12]
- Gradient based [13]
  - $w \leftarrow w \times 1_{w > \tau}$  ,approximate indicator by sigmoid during backdrop
- Rank —> pruned —> retrain —> (repeat)
  - Magnitude [14], minimal influence on objective [15]
- [10] Molchanov et al. Variational dropout sparsifies deep neural networks. ICML '17
- [11] Louizos et al. Bayesian compression for deep learning. NeurIPS '17
- [12] He et al. AMC: AutoML for Model Compression and Acceleration on Mobile Devices. ECCV '18
- [13] Fedorov et al. TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids. Interspeech '20



[14] Han et al. Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. ICLR '16
[15] Molchanov et al. Importance Estimation for Neural Network Pruning. CVPR '19



#### Neural architecture search



- The search strategy
  - Black box (RL, Bayesian optimization, genetic alg., etc.) [16] vs gradient based [17]
  - Hardware model —> do we need it, or is there a viable proxy?
- Estimation strategy
  - Computationally expensive —> thousands of GPU days in some cases
  - Weight sharing
  - Morphisms
- The distinction between quantization / pruning and NAS is arbitrary [18]



[17] Liu et al. DARTS: Differentiable Architecture Search. ICLR '19

[18] Cai and Vasconcelos. Rethinking Differentiable Search for Mixed-Precision Neural Networks. CVPR '20



# A few examples and lessons from our work SpArSe [1]

- Multi-objective Bayesian optimization
- Architecture, channel / weight pruning thresholds, training hyperparameters
- Closed form memory model
  - SRAM usage modeled by sum of input and output tensors for each layer
- Able to deploy to devices previously thought too small for NNs
- The deployment tool (uTensor) only supported feed-toward graphs
- Bayesian optimization is slow, even with "tricks" to speed it up —> 10 GPU days on CIFAR10-binary

|                | MNIST                  |                            |      | CIFAR10-binary        |                                |      | C                      | CUReT          |      |                       | Chars4k                    |      |  |
|----------------|------------------------|----------------------------|------|-----------------------|--------------------------------|------|------------------------|----------------|------|-----------------------|----------------------------|------|--|
|                | Acc                    | <u>3</u>                   | GPUD | Acc                   | 3<br> 0<br> -                  | GPUD | Acc                    | <u>3</u><br>0  | GPUD | Acc                   | $\frac{\mathcal{B}}{\ _0}$ | GPUD |  |
| Bonsai         | <b>97.24</b><br>97.01  | <b>510</b> 2.15 <i>e</i> 4 | 11   | <b>73.08</b><br>73.02 | <b>487</b><br>512              | 1    | <b>96.45</b><br>95.23  | 8.5e3<br>2.9e4 | 1    | <b>67.82</b> 58.59    | 1.7e3<br>2.6e4             | 1    |  |
| Bonsai (16 kB) | -                      | -                          |      | <b>76.66</b> 76.64    | <b>1.4e3</b><br>4.1 <i>e</i> 3 | 9    |                        | -              | _    |                       |                            | _    |  |
| ProtoNN        | <b>96.84</b><br>95.88  | <b>476</b> 1.6e4           | 11   | <b>76.56</b> 76.35    | 1.4e3<br>4.1e3                 | 10   | <b>96.45</b><br>94.44  | 8.5e3<br>1.6e4 | 1    |                       | -                          | _    |  |
| GBDT           | 9 <b>8.78</b> 97.90    | 7.5e6                      | 11   | 77.90 77.19           | 1.6e3<br>4e5                   | 8    | <b>96.45</b> 90.81     | 8.5e3<br>6.1e5 | 1    | 67.82<br>43.34        | 1.7e3 =<br>2.5e6           | 1    |  |
| kNN            | 9 <b>6.84</b><br>94.34 | 4.71 <i>e</i> 7            | 11   | <b>76.34</b> 73.70    | 1.4e3<br>2e7                   | 10   | <b>96.45</b><br>89.81  | 8.5e3<br>2.6e6 | 2    | 67.82<br>39.32        | 1.7e3<br>1.7e6             | 1    |  |
| RBF-SVM        | 9 <b>7.42</b><br>97.30 | 569<br>1e7                 | 10   | <b>81.77</b><br>81.68 | 3.2e3<br>1.6e7                 | 3    | 9 <b>7.58</b><br>97.43 | 2.2e4<br>2.3e6 | 2    | <b>67.82</b><br>48.04 | 1.7e3<br>2e6               | 1    |  |
| LeNet + SpVD   | <b>99.16</b><br>99.10  | 1e3<br>1.8e3               | 8    | <b>75.35</b> 75.09    | 1.4e3<br>1.6e5                 | 10   | -                      | -              | _    |                       | _                          | _    |  |
| MODC           | <b>99.17</b><br>99.15  | 1.45e3<br>3e3              | 1    | -                     | -                              | _    | -                      | -              | _    | <br>  <del>-</del>    | _                          | _    |  |

[1] Fedorov et al., SpArSe: Sparse Architecture Search for CNNs on Resource-Constrained Microcontrollers. NeurIPS '19



## A few examples and lessons from our work

#### TinyLSTMs [13]

- Speech denoising using LSTMs
- Latency constraint w/ ops proxy
- Neuron pruning to reduce ops
  - Gradient based threshold learning for efficiency
- Quantization to run w/ integer math
- LSTMs were not supported by TFlitemicro

|                        | SISDR (dB) | BSS SDR (dB) | Params (M) | MS (MB) | WM (KB) | MOps/inf. | Latency (ms/inf.) | Energy (mJ/inf.)  | GPUH  |
|------------------------|------------|--------------|------------|---------|---------|-----------|-------------------|-------------------|-------|
| Baseline (FP32)        | 11.99      | 12.77        | 0.97       | 3.70    | 26.0    | 1.94      | 12.52*            | 6.76*             | 14    |
| Pruned (FP32)          | 11.99      | 12.78        | 0.52       | 1.97    | 18.8    | 1.04      | 6.71*             | 3.62*             | 72    |
| Pruned (INT8) 1        | 11.80      | 12.69        | 0.61       | 0.58    | 5.1     | 1.22      | 7.87              | 4.25              | 61    |
| Pruned (INT8) 2        | 11.47      | 12.22        | 0.33       | 0.31    | 3.7     | 0.66      | 4.26              | 2.30              | 144   |
| Pruned Skip RNN (INT8) | 11.42      | 12.07        | 0.46       | 0.43    | 4.67    | 0.37      | 2.39 <sup>†</sup> | 1.29 <sup>†</sup> | 275   |
| Erdogan et al. [20]    |            | 13.36        | 0.96       | 3.65    | 25.9    | 1.92      | 12.39             | 6.69              | - Q   |
| Wilson et al. [6]      | 0.70       | 14.60        | 65         | 247.96  | 4472.6  | 130       | 839*              | 453               | 18360 |



[13] Fedorov et al. TinyLSTMs: Efficient Neural Speech Enhancement for Hearing Aids. Interspeech '20



# A few examples and lessons from our work

#### MicroNets [19]

- Differentiable NAS
- Commodity MCU target
- 3 TinyMLperf datasets
- Optimize for Flash, SRAM, ops —> good proxy for latency on MCUs
- Search cost on the order of hours
- TFlite-micro deployment tool
- Memory overheads difficult to predict



[19] Banbury et al. MicroNets: Neural Network Architectures for Deploying TinyML Applications on Commodity Microcontrollers. MLsys '21

