# Taurus: An Intelligent Data Plane

#### Muhammad Shahbaz

Tushar Swamy, Alex Rucker, and Kunle Olukotun



# Taurus: An Intelligent Data Plane

Programmable Data Plane
Image: Comparison of the sector of the secto

## Managing Networks is Hard!













#### *Slow* but *intelligent*



#### Examples:

- Congestion control
- Load balancing (ECMP, RSS)
- Queue scheduling
- and more

#### Characteristics:

- Operates on packets or flowlets (*i.e.*, bursts of packets)
- Uses heuristics ... hash, etc.
- Low latency ...  $\leq$  sub  $\mu$ s
- High throughput ... Tbps





#### *Slow* but *intelligent*





*Slow* but *intelligent* 

#### Examples:

- Anomaly detection
- Automation
- Recommendation

#### Characteristics:

- Operates on flows
- Performs complicated tasks
- Sub-second latency
- Low throughput

#### Control Plane Intelligence

#### *Slow* but *intelligent*

Data Plane Switch or NIC

#### Control Plane Intelligence

#### **Slow** but *intelligent*

Data Plane Switch or NIC



#### intelligent



Fast and intelligent

#### Taurus: An Intelligent Data Plane



# What does "intelligence" mean?

- Networks are becoming autonomous, *Self-Driving Networks*.
- Machine learning (ML) will play a key role in the future of networks.





Input Layer

Hidden Layers

Output Layer



Input Layer

Hidden Layers

Output Layer









- It's not suitable for learned operations:
  - Arithmetic intensity is too low to perform ML operations
  - Not enough intermediate storage to carry feature and state
  - and more ...







## Taurus: An Intelligent Data Plane



## Taurus: An Intelligent Data Plane



• Implements a spatial SIMD architecture

#### Taurus: Map Reduce Block





# Map Reduce Block: Compute Unit (CU)

- Taurus **CUs** are array-based:
  - Functional Units (FUs)
  - Pipeline Registers (PRs)



## Map Reduce Block: Compute Unit (CU)



Reduction network condenses vectors to scalars

## Taurus: An Intelligent Data Plane



## **Example: Anomaly Detection**



>>>

| Parse packets         | Retrieve <b>out of</b> | Apply learned  | Select a port    | Send packets  |
|-----------------------|------------------------|----------------|------------------|---------------|
| and <b>read local</b> | network events         | functions to   | or <b>action</b> | out the       |
| features (e.g.,       | (e.g., failed          | mark anomalous | (drop if         | selected port |
| IP address)           | logins per IP)         | packets        | score == 1)      |               |

# Evaluation: Anomaly Detection in Switches

• Taurus examines every packet at line rate



• Added latency is less than port-to-port latency

|       |            | Area          | Power |     |
|-------|------------|---------------|-------|-----|
| Model | Throughput | Latency       | +%    | +%  |
| SVM   | 1 GPkt/s   | <b>68</b> ns  | 6.1   | 1.1 |
| DNN   | 1 GPkt/s   | <b>362</b> ns | 11.7  | 2.0 |

\*Overheads are calculated relative to a 300 mm<sup>2</sup> chip with 4 reconfigurable pipelines, each drawing an estimated 25 W

# Evaluation: Congestion Control at the NICs

• Indigo is a congestion control LSTM network



• Taurus updates every 12.5 ns (software updates every 10 ms)

|       |                    | Area          | Power |     |
|-------|--------------------|---------------|-------|-----|
| Model | Throughput         | Latency       | +%    | +%  |
| LSTM  | <b>0.08</b> GPkt/s | <b>380</b> ns | 23.6  | 4.1 |

\*Overheads are calculated relative to a 300 mm<sup>2</sup> chip with 4 reconfigurable pipelines, each drawing an estimated 25 W





Fast and intelligent

- Designed to **run machine-learning inference** inside a data plane
- Provides orders of magnitude improvement over existing approaches







Fast and intelligent



Muhammad Shahbaz

http://cs.stanford.edu/~mshahbaz

# Backup slides ...



- Implements a **finite state machine (FSM)** that operates on a userdefined **parse graph**
- Converts the incoming packet bit stream into vectors, e.g.,
  - headers (IP or TCP)





- A match-action table:
  - Memory for exact (SRAM) and ternary (TCAM) match
  - ALU for basic single-cycle VLIW operations (no loops or multiplication),



- Responsible for **storing** and **forwarding** packets off of the chip:
  - Queuing: buffer incoming packet
  - **Replication**: clone packets across multiple egress ports (*e.g.*, multicast)
  - **Scheduling**: forward packets based on a queuing discipline (*e.g.*, PIFO) or instructions from the match-action tables