### dMAzeRunner: Accelerating loop nests on dataflow accelerators

### **Aviral Shrivastava**

10/14/19

Joint work with:

Shail Dave (ASU) Sasikanth Avancha (Intel PCL) Youngbin Kim and Kyoungwoo Lee (Yonsei)





## **Must-Accelerate Applications in ML Era**

#### Widely Used ML Models

#### Multi Layer Perceptrons





http://yann.lecun.com/exdb/lenet/

#### **Sequence Models**



http://jalammar.github.io/visualizing-neural-machine-translation-mechanics-ofseq2seq-models-with-attention/ https://deeplearning.mit.edu/

#### **Reinforcement Learning**



AlphaGo. https://www.nature.com /articles/nature24270





http://vision03.csail.mit.edu/cnn art/index.html https://pireddie.com/darknet/

#### Graph Neural Networks



Points of Interest

YOW! Data 2018 Conference. https://www.youtube.com/watch?v=IDRb3CjESmM

#### **Popular Applications**

- **Object Classification/Detection**
- Media Processing/Generation
- Large-Scale Scientific Computing





https://giphy.com

**Tropical Cyclon Detection** https://insidehpc.com/2019/02/gordonbell-prize-highlights-the-impact-of-ai/

#### **Designing Software 2.0**

10/14/19

Google shrinks language translation code from 500k LoC to 500

https://jack-clark.net/2017/10/09/import-ai-63-google-shrinkslanguage-translation-code-from-500000-to-500-lines-with-ai-only-25of-surveyed-people-believe-automationbetter-jobs/ Kunle Olukotun, NeurIPS 2018 Invited talk.

#### and more ...



Web page: aviral.lab.asu.edu

2

### **Dataflow Accelerators: Promising Solution**



prefetching, data distribution, data allocation.

[1] Norman Jouppi et al. In-datacenter performance analysis of a tensor processing unit. In ISCA 2017.

- [2] Yu-Hsin Chen et al. Eyeriss: An energy-efficient reconfigurable accelerator for deep cnns. In JSSC 2016
- [3] Dataflow Processing Unit from Wave Computing. In HOTCHIPS 2017.
- [4] M. Thottethodi and T. N. Vijaykumar. Why the GPGPU is Less Efficient than the TPU for DNNs. ACM SIGARCH Blog, Jan 2019. (online)
- [5] Bruce Fleischer et al., A Scalable Multi-TeraOPS Core for AI Training and Inference. In VLSI 2018.
- [6] Manupa Karunaratne et al. Hycube: A cgra with reconfigurable single-cycle multi-hop interconnect. In DAC 2017.



#### SCNN, nVIDIA

|                    | PE | PE | PE |
|--------------------|----|----|----|
|                    | PE | PE | PE |
| Layer<br>Sequencer | PE | PE | PE |





DRAM (Off-Chip)

### Our current focus in the system stack



# **Execution Modeling of Dataflow Accelerators**



Shail Dave, Youngbin Kim, Sasikanth Avancha, Kyoungwoo Lee, Aviral Shrivastava, dMazeRunner: Executing O Perfectly Nested Loops on Dataflow Accelerators [CODES+ISSS, TECS 2019].

Features with detailed modeling of

- ✓ Analyze arbitrary perfectly nested loops.
- ✓ miss penalty and stall cycles (PE execution, managing PE/shared memory).
- ✓ inter-PE communication.
- ✓ temporal/spatial data reuse.
- ✓ Integrated support common ML libraries MXNet/Keras/Tensorflow/...

(thanks TVM! – leveraging front-end)

Step-wise equations and analysis in the paper

#### Validation of Dataflow Model against Eyeriss Chip



Chen, Yu-Hsin et al. "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks." [JSSC '17]

#### Validation against DNN Optimizer of Yang et al.

Yang, Xuan, M. Gao, J. Pu, A. Nayak, Q. Liu, S. Bell, J. Setter, K. Cao, H. Ha, Christos Kozyrakis, and Mark Horowitz. "DNN Dataflow Choice Is Overrated." [arXiv '18]



Energy estimate differs by ~4.2% for variety of execution methods
 For efficient mappings, major energy spent in RF accesses \_\_\_\_\_\_



# **DiRAC: Microarch and Cycle-accurate Simulator**

#### Dynamically reconfigurable dataflow accelerator architecture template.

- (a)synchronous execution of pipelined PEs, double-buffered larger RFs
- **Programmable multicast network** for arbitrary dataflow
- 2D-mesh interconnect for fast inter-PE communications
- Multi-bank, conflict-free, software-directed scratchpad management
- Architectural template serves as a kick-starter baseline.
  Extend to support various interconnect/PEs/memory architecture

#### Cycle-accurate simulation of accelerator system.

- Explore FSM/ISA variations for PEs and controller, sensitivity analysis,...
- Work-in-progress. Release planned in Q4 end (Dec '19).
- Next step: FPGA emulation for functional testing + rapid prototyping.
- Develop + Integrate area/power model for comprehensive design exploration.

#### Outcomes

- Validation of optimizations achieved through analytical modeling.
- **Easy tool for prototyping** domain-specific accelerator architecture.
- **Educational/Training:** Tool for teaching and hands-on with ML accelerators.



Plan for Integration of DiRAC with dMazeRunner.



### **Our current focus in the system stack**



### **Spatio-Temporal Execution on dataflow accelerators**



# Vast "Execution Method" Space

- Many many ways to execute nested loops (of DNN) on a dataflow accelerator
  - Both software and hardware design space
  - Hardware: Size, layout and connectivity of PEs, SPM size, no. of regs, NOC params, etc.
  - Software: loop mappings, e.g.,
    Spatial: output stationary, or row stationary,
     Temporal: order and tiling of loops, data buffering, etc.

Web page: aviral.lab.asu.edu

#### 4D Convolution:



Conv5\_2 [ResNet]



## **Config#1: 1D Spatial Execution**





### **Config#2: 2D Spatial Execution**



## **Config#3: 3D Spatial Execution**



## Reordering of the loops $\rightarrow$ Different dataflow



### **Exploration of "execution methods"**



for n  $I_{13} = 1:N$  DRAM for  $\overline{m}$  L3 = 1: $\overline{M}$  DRAM for  $\overline{C}$  L3 = 1: $\overline{C}$  DRAM for ox L3 = 1:Ox DRAM for oy L3 = 1:Oy DRAM for fx L3 = 1:Fx DRAM for fy L3 = 1:Fy DRAM dma(); for n L2 = 1:N SPM for m L2 = 1:M SPM for c L2 = 1:C SPM for ox L2 = 1:Ox SPMfor oy L2 = 1:Oy SPMfor fx L2 = 1:Fx SPM for fy L2 = 1:Fy SPM communicate data NoC( ); for n L1 = 1:N RFfor m L1 = 1:M RF for  $\overline{c}$  L1 = 1: $\overline{C}$  RF for ox L1 = 1:Ox RF for oy L1 = 1:Oy RF for fx L1 = 1:Fx RF for fy L1 = 1:Fy RF for n S = 1:N SPATIAL for m S = 1:M SPATIAL for c S = 1:C SPATIAL for ox S = 1:0x SPATIAL for oy S = 1:0y SPATIAL for fx L3 = 1:Fx SPATIAL for fy L3 = 1:Fy SPATIAL W[1[1]]

10/14/1

## **Drastic pruning of Search Space**

#### **Example: Generating loop-orderings with unique data reuse factors**



## **Results: 9X reduction in EDP**



## **Adaptable Mappings Yield Better Results**

- Adapts to kernel/arch characteristics
  - Scales for layers/tensors of different shapes
- Finds non-intuitive mappings that optimizes various factors e.g.,
   ✓ High resource utilization
  - Maximized reuse of multiple data operands
  - ✓ Minimized DRAM accesses

### Efficient interleaving of computation with communication latency

[1] S. Gupta et al. Deep learning with limited numerical precision. In ICML, 2015.
 [2] Y. Chen et al. Eyeriss: A spatial architecture for energy-efficient dataflow for CNNs. In ISCA 2016.
 I8 Web page: aviral.lab

#### Example Mappings of ResNet Conv5\_2 for Output-Stationary Dataflow

|                   | мос                                                                 |     |     |    |    |    | dMazeRunner                                                         |                                                                     |     |     |    |    |    |    |
|-------------------|---------------------------------------------------------------------|-----|-----|----|----|----|---------------------------------------------------------------------|---------------------------------------------------------------------|-----|-----|----|----|----|----|
| Tiling<br>Factors | N                                                                   | м   | с   | Ох | Оу | Fx | Fy                                                                  | N                                                                   | м   | с   | Ох | Oy | Fx | Fy |
| SPATIAL           | 1                                                                   | 4   | 1   | 7  | 7  | 1  | 1                                                                   | 1                                                                   | 4   | 1   | 7  | 7  | 1  | 1  |
| RF                | 1                                                                   | 2   | 8   | 1  | 1  | 3  | з                                                                   | 4                                                                   | 16  | 1   | 1  | 1  | 3  | 3  |
| SPM               | 2                                                                   | 2   | 8   | 1  | 1  | 1  | 1                                                                   | 1                                                                   | 1   | 8   | 1  | 1  | 1  | 1  |
| DRAM              | 2                                                                   | 32  | 8   | 1  | 1  | 1  | 1                                                                   | 1                                                                   | 8   | 64  | 1  | 1  | 1  | 1  |
| Base              | 4                                                                   | 512 | 512 | 7  | 7  | 3  | 3                                                                   | 4                                                                   | 512 | 512 | 7  | 7  | 3  | 3  |
| L2_Order          | {n_L2, m_L2, oy_L2, ox_L2, fy_L2 , fx_L2,<br>c_L2} (outer to inner) |     |     |    |    |    | <_L2,                                                               | {n_L2, m_L2, oy_L2, ox_L2, fy_L2 , fx_L2,<br>c_L2} (outer to inner) |     |     |    |    |    |    |
| L3_Order          | {n_L3, m_L3, oy_L3, ox_L3, fy_L3 , fx_L3,<br>c_L3} (outer to inner) |     |     |    |    |    | {n_L3, m_L3, oy_L3, ox_L3, fy_L3 , fx_L3,<br>c_L3} (outer to inner) |                                                                     |     |     |    |    |    |    |

MOC: Simultaneous spatial processing of Multiple Output Channels [1, 2]

|       | For data allocated in RFs of PEs,                              | MOC         | dMzRnr      |
|-------|----------------------------------------------------------------|-------------|-------------|
|       | <b>PE Compute vs. Data comm. Latency:</b>                      | 144 vs. 648 | 576 vs. 576 |
| y     | Total cycles:                                                  | ~10,616,832 | ~2,459,648  |
| -     | Ideal execution cycles for output-stationary:                  | 2,359,296   | 2,359,296   |
| al    | Reduction in DRAM accesses (ifmaps, weights):                  | (Ix, Ix)    | (4.57x, 2x) |
| rgy-  | <b>Perf. improvement</b> (normalized to MOC):                  | lx          | 4.44        |
| aviro | Energy-Delay-Product reduction (normalized):<br>al.lab.asu.edu | , Ix        |             |

# **Achieving Close-to-Optimal Solutions in Seconds**

#### • Even domain non-experts can explore the space

python run\_optimizer.py --frontend mxnet --model resnet18 --layer-index 0

- **Does not preclude experts**/programmers from directing the search.
  - ▶ In-built support for a few common opt strategies.
- Quick exploration:
  - EDP ~2% higher vs. optimal of brute-force search (seconds vs. days/hours)
  - Implementation multi-threaded, caches commonly invoked routines of analytical model.
  - Enables effective DSE of architecture.

[Alpha Release] https://github.com/cmlasu/dMazeRunner

Search Space Exploration on an Intel i7-6700 Quad-core CPU min: 1 second, ResNet conv5\_2 (753 methods) max: 122 seconds, ResNet conv2\_2 (122092 methods)

[dMazeRunner, CODES+ISSS '19]

Optimizing Memory Sizes for ResNet18 Layers DSE for 256-PE CGRA



# **Summary and Next Steps**

- Coarse-grained dataflow accelerators promising for accelerating ML models.
  - Challenge: Programming the accelerators
  - System stack can extend the applicability.
- dMazeRunner App Mapping Framework
  - Analytical Power and performance model
  - Automated Design Space Exploration
- End-to-end system [WIP]
  - Programmable Microarch + Simulator
  - FPGA emulation
- Further Opportunities
  - Sparsity [WIP]: Support dynamic sparsity of varying levels (inference + training).
  - Multi-chip module accelerations
- Exciting times ahead!





