• Stars
    star
    121
  • Rank 292,239 (Top 6 %)
  • Language
    Python
  • Created about 4 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Research and development for optimizing transformers

Substation: Optimized Transformers ⚑

Substation is a project to optimize transformers using data movement analysis.

This code is presently at a research-and-development stage. We are actively working to make it both faster and more usable.

For more background, please see our paper, Data Movement Is All You Need: A Case Study on Optimizing Transformers. If you use our code, please cite the paper:

@article{ivanov2020data,
  title={Data Movement Is All You Need: A Case Study on Optimizing Transformers},
  author={Ivanov, Andrei and Dryden, Nikoli and Ben-Nun, Tal and Li, Shigang and Hoefler, Torsten},
  journal={arXiv preprint arXiv:2007.00072},
  year={2020}
}

Current Performance

We presently include configurations for two versions of a single BERT-large encoder layer:

  1. Batch size 8 and max sequence length 512.
  2. Batch size 96 and max sequence length 128.

These benchmarks were run on the Lassen supercomputer. Note that the Nvidia V100s this system uses are the SXM2 variety, with a peak of 125 tflop/s using Tensor Cores. We compare with the same transformer architecture implemented in TensorFlow (with XLA), PyTorch, and DeepSpeed. These results are with the latest version of our code, but see our paper for other details.

All times are in milliseconds (ms).

BERT-large, batch size 8, max sequence length 512 runtime

PyTorch TensorFlow+XLA DeepSpeed Substation
9.14 8.4 7.6 6.71

BERT-large, batch size 96, max sequence length 128 runtime

PyTorch TensorFlow+XLA DeepSpeed Substation
18.43 n/a 16.19 15.42

Usage

Note: We are actively working to improve the usability for standard deep learning workflows.

Our encoder implementation is available as a PyTorch module in pytorch_module/encoder.py. Whenever you create a Substation encoder, you must specify an associated set of layouts and other configurations (see below for generating one yourself). We have provided the configurations used for the two BERT-large versions above as layouts-bert-b8-l512-h16-e1024.pickle and layouts-bert-b96-l128-h16-e1024.pickle, respectively. These configurations are optimized for the specific configuration and hardware, but should run for other problem sizes and on other hardware. The underlying optimized implementation for the encoder will be generated and compiled the first time you use it.

For performance benchmarking, we provide the run_encoder.py script. See its --help information for details.

Generating New Configurations

If you want to get the best performance for your particular problem configuration and/or hardware, you will need to generate a configuration. This involves two phases: benchmarking to gather performance data, then configuration selection.

Benchmarking

Warning: This can take a long time.

This exhaustively benchmarks the possible layouts (and other options) for every operator used in the encoder layer. There are two sets of benchmarks, one for tensor contractions (which uses cuBLAS) and one for our custom fused kernel implementations.

Tensor Contractions

These are located in tc_profiling.

  1. Run compile.sh to build cuBLAS benchmarks.
  2. Run einsum_perms.py (e.g., einsum_perms.py --b 8 --j 512 --h 16 --i 1024) to generate the benchmark configurations for each operator.
  3. These configurations can be run with runprof.py <config file>.
Fused Kernels

These are run with the pytorch_module/benchmark.py script. You specify the kernel to benchmark with --kernel name. By default, this uses the batch size 8, sequence length 512 configuration of BERT-large. You can change the size using the --size argument. For example:

python benchmark.py --kernel softmax --size "H=16,B=96,J=128,K=128,U=4096,N=1024,P=64"

See its --help for more arguments.

You will need to run every tensor contraction and kernel benchmark.

Configuration Selection

These scripts are located in the config_selection directory. First, collect the benchmark data into a directory. You can just copy the kernel benchmark output. Use the parse_tc_results.py script to assemble the tensor contraction results and then copy them into the same directory.

Final configuration selection can then be run with python optimize.py --output_config my_layouts.pickle results-dir.

Advanced

The optimize.py script can use several strategies for performing configuration selection, controlled with the --graph_order argument. The default, bp_first, will optimize the encoder layer's backpropagation pass first, and then its forward pass. fp_first will optimize forward propagation first, then backpropagation. bp_first typically results in configurations that are faster than fp_first. The third option, combined, will optimize over forward and backpropagation simultaneously, and typically results in the fastest configurations. However, this approach is somewhat finnicky, and can often fail to find a valid layout. This can be worked around by telling the optimizer to "split" at certain variables using the --split_vars argument.

The layouts-bert-b8-l512-h16-e1024.pickle configuration was generated using optimize.py --graph_order combined --split_vars X LN1 LN2 LIN2 DLIN2. The layouts-bert-b96-l128-h16-e1024.pickle configuration was generated using optimize.py --graph_order combined --split_vars X DROP2 LN1.

Contributors

This project is led by the Scalable Parallel Computing Lab at ETH Zurich.

See also the list of contributors.

More Repositories

1

graph-of-thoughts

Official Implementation of "Graph of Thoughts: Solving Elaborate Problems with Large Language Models"
Python
2,059
star
2

dace

DaCe - Data Centric Parallel Programming
Python
487
star
3

gemm_hls

Scalable systolic array-based matrix-matrix multiplication implemented in Vivado HLS for Xilinx FPGAs.
C++
297
star
4

QuaRot

Code for QuaRot, an end-to-end 4-bit inference of large language models.
Python
247
star
5

pymlir

Python interface for MLIR - the Multi-Level Intermediate Representation
Python
210
star
6

ncc

Neural Code Comprehension: A Learnable Representation of Code Semantics
Python
206
star
7

hls_tutorial_examples

Examples shown as part of the tutorial "Productive parallel programming on FPGA with high-level synthesis".
C++
188
star
8

MRAG

Official Implementation of "Multi-Head RAG: Solving Multi-Aspect Problems with LLMs"
Python
151
star
9

serverless-benchmarks

SeBS: serverless benchmarking suite for automatic performance analysis of FaaS platforms.
Python
142
star
10

pspin

PsPIN: A RISC-V in-network accelerator for flexible high-performance low-power packet processing
SystemVerilog
95
star
11

deep-weather

Deep Learning for Post-Processing Ensemble Weather Forecasts
Jupyter Notebook
85
star
12

daceml

A Data-Centric Compiler for Machine Learning
Python
81
star
13

FBLAS

BLAS implementation for Intel FPGA
C++
75
star
14

open-earth-compiler

development repository for the open earth compiler
MLIR
74
star
15

npbench

NPBench - A Benchmarking Suite for High-Performance NumPy
Python
73
star
16

ucudnn

Accelerating DNN Convolutional Layers with Micro-batches
C++
64
star
17

rFaaS

rFaaS: a high-performance FaaS platform with RDMA acceleration for low-latency invocations.
C++
48
star
18

haystack

Haystack is an analytical cache model that given a program computes the number of cache misses.
C++
42
star
19

sparsity-in-deep-learning

Bibtex for Sparsity in Deep Learning paper (https://arxiv.org/abs/2102.00554) - open for pull requests
TeX
40
star
20

mlir-dace

Data-Centric MLIR dialect
C++
37
star
21

redmark

ReDMArk: Bypassing RDMA Security Mechanisms.
C++
37
star
22

apfp

FPGA acceleration of arbitrary precision floating point computations.
C++
34
star
23

NoPFS

Near-optimal Prefetching System
32
star
24

sten

Sparsity support for PyTorch
Python
31
star
25

rapidchiplet

A toolchain for rapid design space exploration of chiplet architectures
C++
27
star
26

ens10

Scripts and examples for the ENS-10 Ensemble Prediction System machine learning dataset
Python
25
star
27

gms

GraphMineSuite (GMS): a benchmarking suite for graph mining algorithms such as graph pattern matching or graph learning
C++
25
star
28

sage

Python
24
star
29

liblsb

Rebol
23
star
30

smoe

Spatial Mixture-of-Experts
Python
19
star
31

CoRM

CoRM: Compactable Remote Memory over RDMA
C++
19
star
32

dace-vscode

Rich editor for SDFGs with included profiling and debugging, static analysis, and interactive optimization.
TypeScript
18
star
33

kafkadirect

RDMA-enabled Apache Kafka
Java
17
star
34

faaskeeper

A fully serverless implementation of the ZooKeeper coordination protocol.
Python
17
star
35

fmi

Function Message Interface (FMI): library for message-passing and collective communication for serverless functions.
C++
15
star
36

SMI

Streaming Message Interface: High-Performance Distributed Memory Programming on Reconfigurable Hardware
C++
15
star
37

stencilflow

Python
15
star
38

naos

Naos: Serialization-free RDMA networking in Java
Java
15
star
39

absinthe

Absinthe is an optimization framework to fuse and tile stencil codes in one shot
Python
14
star
40

NNCompression

Compressing weather and climate data into neural networks
Python
13
star
41

DNN-cpp-proxies

C++/MPI proxies for distributed training of deep neural networks.
C++
13
star
42

arrow-matrix

Arrow Matrix Decomposition - Communication-Efficient Distributed Sparse Matrix Multiplication
Python
13
star
43

CheckEmbed

Official Implementation of "CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks"
Python
12
star
44

.github

10
star
45

LogGOPSim

A LogGOPS (LogP, LogGP, LogGPS) Simulator and Simulation Framework
C
10
star
46

vldb19-distributed-locking

This repository hosts the code used for the following paper: Claude Barthels, Ingo MΓΌller, Konstantin Taranov, Torsten Hoefler, Gustavo Alonso. "Strong consistency is not hard to get: Two-Phase Locking and Two-Phase Commit on Thousands of Cores." In: PVLDB, 2020.
C++
10
star
47

SimFS

SimFS: A Virtualizing Simulation Data File System Interface
C++
8
star
48

CLaMPI

Caching Layer for MPI
C
8
star
49

FBACode

Python
8
star
50

nbody_hls

Implementation of the N^2-formulation of N-body simulation with Vivado HLS for SDAccel platforms.
C++
8
star
51

GDI-RMA

Official Implementation of "The Graph Database Interface: Scaling Online Transactional and Analytical Graph Workloads to Hundreds of Thousands of Cores"
C
8
star
52

DiffDA

Python
7
star
53

stencil_hls

Implementation of time and space-tiled stencil in Vivado HLS.
C++
7
star
54

open-earth-benchmarks

Open repository for climate and weather benchmark kernels
C++
7
star
55

cppless

C++
6
star
56

polybench-comparator

Regression and comparison tools for the Polybench benchmark
Shell
6
star
57

nevermore

The source code for the Nevermore paper at ACM CCS'22
C++
6
star
58

foMPI-NA

C
6
star
59

perf-taint

Taint-based program analysis framework for empirical performance modeling.
LLVM
5
star
60

streamingsched

Streaming Task Scheduling
Python
5
star
61

faaskeeper-python

Python client library for FaaSKeeper, the serverless ZooKeeeper.
Python
5
star
62

muliticast-based-allgather

C
4
star
63

libNBC

Shell
3
star
64

climetlab-maelstrom-ens10

MAELSTROM ENS10 dataset plugin for CliMetLab
Jupyter Notebook
3
star
65

dace-webclient

Web-based SDFG viewer for DaCe
JavaScript
3
star
66

libhear

C++
3
star
67

TCPunch

C++
3
star
68

LGSxNS3

Python
2
star
69

cppless-clang

2
star
70

c2dace

C
2
star
71

probgraph

Emacs Lisp
2
star
72

LogGOPSim2

C++
2
star
73

fflib

C
2
star
74

serverless-benchmarks-data

TeX
2
star
75

rivets

C
2
star
76

spatial-collectives

Optimized communication collectives for the Cerebras waferscale engine
Python
2
star
77

conflux

C++
1
star
78

fuzzyflow-artifact

Computational artifacts for the FuzzyFlow publication
Shell
1
star
79

SAILOR

Python
1
star
80

praas-benchmarks

Jupyter Notebook
1
star
81

HTSIM-old

C++
1
star
82

faas-profiler

Python
1
star
83

UPM

User-guided Page Merging: Memory Deduplication for Serverless
C
1
star
84

f2dace-artifact

Fortran
1
star
85

smat

Code for High Performance Unstructured SpMM Computation Using Tensor Cores
Emacs Lisp
1
star