• Stars
    star
    167
  • Rank 226,635 (Top 5 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created almost 4 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Switch ML Application

SwitchML: Switch-Based Training Acceleration for Machine Learning

SwitchML accelerates the Allreduce communication primitive commonly used by distributed Machine Learning frameworks. It uses a programmable switch dataplane to perform in-network computation, reducing the volume of exchanged data by aggregating vectors (e.g., model updates) from multiple workers in the network. It provides an end-host library that can be integrated with ML frameworks to provide an efficient solution that speeds up training for a number of real-world benchmark models.

The switch hardware is programmed with a P4 program for the Tofino Native Architecture (TNA) and managed at runtime through a Python controller using BFRuntime. The end-host library provides simple APIs to perform Allreduce operations using different transport protocols. We currently support UDP through DPDK and RDMA UC. The library has already been integrated with ML frameworks as a NCCL plugin.

Getting started

To run SwitchML you need to:

The examples folder provides simple programs that show how to use the APIs.

Repo organization

The SwitchML repository is organized as follows:

docs: project documentation
dev_root:
  ┣ p4: P4 code for TNA
  ┣ controller: switch controller program
  ┣ client_lib: end-host library
  ┣ examples: set of example programs
  ┣ benchmarks: programs used to test raw performance
  ┣ frameworks_integration: code to integrate with ML frameworks
  ┣ third_party: third party software
  ┣ protos: protobuf description for the interface between controller and end-host
  â”— scripts: helper scripts

Testing

The benchmarks contain a benchmarks program that we used to measure SwitchML performances. In our experiments (see benchmark documentation for details) we observed a more than 2x speedup over NCCL when using RDMA. Moreover, differently from ring Allreduce, with SwitchML performance are constant with any number of workers.

Benchmarks

Publication

Scaling Distributed Machine Learning with In-Network Aggregation A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, P. Richtarik. In Proceedings of NSDI’21, Apr 2021.

Contributing

This project welcomes contributions and suggestions. To learn more about making a contribution to SwitchML, please see our Contribution page.

The Team

SwitchML is a project driven by the P4.org community and is currently maintained by Amedeo Sapio, Omar Alama, Marco Canini, Jacob Nelson.

License

SwitchML is released with an Apache License 2.0, as found in the LICENSE file.

More Repositories

1

tutorials

P4 language tutorials
P4
1,332
star
2

p4c

P4_16 reference compiler
C++
671
star
3

behavioral-model

The reference P4 software switch
C++
536
star
4

p4-spec

TeX
175
star
5

p4factory

Compile P4 and run the P4 behavioral simulator
C
174
star
6

PI

An implementation framework for a P4Runtime server
C++
165
star
7

switch

Consolidated switch repo (API, SAI and Nettlink)
C
152
star
8

p4runtime

Specification documents for the P4Runtime control-plane API
Rust
146
star
9

ptf

Packet Test Framework
Python
144
star
10

p4pi

P4 on Raspberry Pi for Networking Education
JavaScript
123
star
11

p4app

Python
112
star
12

p4-applications

P4 Applications WG repo
P4
107
star
13

p4runtime-shell

An interactive Python shell for P4Runtime
Python
76
star
14

p4-dpdk-target

P4 driver SW for P4 DPDK target.
C++
56
star
15

pna

Portable NIC Architecture
P4
54
star
16

tdi

Table-Driven Interface (TDI) for a P4-programmable backend device.
C++
39
star
17

papers

Repository for papers related to P4
C
38
star
18

ntf

Network Test Framework
Python
37
star
19

education

P4 for Education
36
star
20

scapy-vxlan

A scapy clone, with support for additional packet headers
Python
36
star
21

p4-hlir

Python
32
star
22

p4ofagent

Openflow agent on a P4 dataplane
C
27
star
23

p4c-bm

Generates the JSON configuration for the behavioral-model (bmv2), as well as the C/C++ PD code
Python
24
star
24

p4lang.github.io

Deprecated P4.org website
HTML
23
star
25

p4analyzer

A Language Server Protocol (LSP) compliant analyzer for the P4 language
Rust
19
star
26

p4c-behavioral

[deprecated] P4 compiler for the behavioral model
C
18
star
27

p4app-TCP-INT

C
18
star
28

p4-constraints

Constraints on P4 objects enforced at runtime
C++
14
star
29

p4-build

Infrastructure needed to generate, build and install the PD library for a given P4 program
C++
12
star
30

gsoc

P4.org's Participation in Google Summer of Code
9
star
31

third-party

Third-party dependencies for p4lang software
Dockerfile
8
star
32

project-ideas

Ideas for P4 Projects.
6
star
33

target-utils

C
4
star
34

target-syslibs

C
3
star
35

hackathons

P4
2
star
36

governance

1
star