• Stars
    star
    776
  • Rank 58,561 (Top 2 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 5 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A GPipe implementation in PyTorch

torchgpipe

PyPI Build Status Coverage Status Documentation Status Korean README

A GPipe implementation in PyTorch. It is optimized for CUDA rather than TPU.

from torchgpipe import GPipe
model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)
output = model(input)

What is GPipe?

GPipe is a scalable pipeline parallelism library published by Google Brain, which allows efficient training of large, memory-consuming models. According to the paper, GPipe can train a 25x larger model by using 8x devices (TPU), and train a model 3.5x faster by using 4x devices.

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism

Google trained AmoebaNet-B with 557M parameters over GPipe. This model has achieved 84.3% top-1 and 97.0% top-5 accuracy on ImageNet classification benchmark (the state-of-the-art performance as of May 2019).

GPipe uses (a) pipeline parallelism and (b) automatic recomputation of the forward propagation during the backpropagation, hence leverages training a large model. We refer to (b) as checkpointing, following the well-known terminology in PyTorch community.

Pipeline Parallelism
GPipe splits a model into multiple partitions and places each partition on a different device to occupy more memory capacity. And it splits a mini-batch into multiple micro-batches to make the partitions work as parallel as possible.
Checkpointing
Checkpointing is applied to each partition to minimize the overall memory consumption by a model. During forward propagation, only the tensors at the boundaries between partitions are remembered. All other intermediate tensors are volatilized, and recomputed during backpropagation when necessary.

Usage

Currently, torchgpipe requires the following environments:

  • Python 3.6+
  • PyTorch 1.1+

To use torchgpipe, install it via PyPI:

$ pip install torchgpipe

To train a module with GPipe, simply wrap it with torchgpipe.GPipe. Your module must be nn.Sequential as GPipe will automatically split the module into partitions with consecutive layers. balance argument determines the number of layers in each partition. chunks argument specifies the number of micro-batches. Input, output, and intermediate tensors must be Tensor or Tuple[Tensor, ...].

The below example code shows how to split a module with four layers into four partitions each having a single layer. This code also splits a mini-batch into 8 micro-batches:

from torchgpipe import GPipe

model = nn.Sequential(a, b, c, d)
model = GPipe(model, balance=[1, 1, 1, 1], chunks=8)

for input in data_loader:
    output = model(input)

Documentation

Visit torchgpipe.readthedocs.io for more information including the API references.

Benchmarking

The full details and more benchmarks are available in torchgpipe.readthedocs.io.

ResNet-101 Accuracy Benchmark

Batch size torchgpipe nn.DataParallel Goyal et al.
256 21.99±0.13 22.02±0.11 22.08±0.06
1K 22.24±0.19 22.04±0.24 N/A
4K 22.13±0.09 N/A N/A

GPipe should be transparent not to introduce additional hyperparameter tuning. To verify the transparency, we reproduced top-1 error rate of ResNet-101 on ImageNet, as reported in Table 2(c) of Accurate, Large Minibatch SGD by Goyal et al.

U-Net (B, C) Memory Benchmark

Experiment U-Net (B, C) Parameters Memory usage
baseline (6, 72) 362.2M 20.3 GiB
pipeline-1 (11, 128) 2.21B 20.5 GiB
pipeline-2 (24, 128) 4.99B 43.4 GiB
pipeline-4 (24, 160) 7.80B 79.1 GiB
pipeline-8 (48, 160) 15.82B 154.1 GiB

The table shows how GPipe facilitates scaling U-Net models. baseline denotes the baseline without pipeline parallelism nor checkpointing, and pipeline-1, -2, -4, -8 denotes that the model is trained with GPipe with the corresponding number of partitions.

Here we used a simplified U-Net architecture. The size of a model is determined by hyperparameters B and C which are proportional to the number of layers and filters, respectively.

U-Net (5, 64) Speed Benchmark

Experiment Throughput Speed up
baseline 28.500/s 1×
pipeline-1 24.456/s 0.858×
pipeline-2 35.502/s 1.246×
pipeline-4 67.042/s 2.352×
pipeline-8 88.497/s 3.105×

To verify efficiency with skip connections, we measured the throughput of U-Net with various number of devices. We chose to use U-Net since it has several long skip connections.

AmoebaNet-D (18, 256) Speed Benchmark

Experiment Throughput torchgpipe Huang et al.
n=2, m=1 26.733/s 1× 1×
n=2, m=4 41.133/s 1.546× 1.07×
n=2, m=32 47.386/s 1.780× 1.21×
n=4, m=1 26.827/s 1.006× 1.13×
n=4, m=4 44.543/s 1.680× 1.26×
n=4, m=32 72.412/s 2.711× 1.84×
n=8, m=1 24.918/s 0.932× 1.38×
n=8, m=4 70.065/s 2.625× 1.72×
n=8, m=32 132.413/s 4.966× 3.48×

(n: number of partitions, m: number of micro-batches)

The table shows the reproduced speed benchmark on AmoebaNet-D (18, 256), as reported in Table 2 of GPipe by Huang et al. Note that we replaced K in the paper with n.

Notes

This project is functional, but the interface is not confirmed yet. All public APIs are subject to change without warning until v0.1.0.

Authors and Licensing

torchgpipe project is developed by Heungsub Lee, Myungryong Jeong, and Chiheon Kim at Kakao Brain, with Sungbin Lim, Ildoo Kim, Woonhyuk Baek, and Boogeon Yoon's help. It is distributed under the 3-clause BSD license.

Citation

If you apply this library to any project and research, please cite our code:

@article{kim2020torchgpipe,
    title={torchgpipe: On-the-fly Pipeline Parallelism for Training Giant Models},
    author={Chiheon Kim and Heungsub Lee and Myungryong Jeong and Woonhyuk Baek and Boogeon Yoon and Ildoo Kim and Sungbin Lim and Sungwoong Kim},
    year={2020},
    eprint={2004.09910},
    archivePrefix={arXiv}
}

More Repositories

1

fast-autoaugment

Official Implementation of 'Fast AutoAugment' in PyTorch.
Python
1,587
star
2

nerf-factory

An awesome PyTorch NeRF library
Python
1,265
star
3

pororo

PORORO: Platform Of neuRal mOdels for natuRal language prOcessing
Python
1,252
star
4

coyo-dataset

COYO-700M: Large-scale Image-Text Pair Dataset
Python
1,062
star
5

kogpt

KakaoBrain KoGPT (Korean Generative Pre-trained Transformer)
Python
1,000
star
6

karlo

Python
679
star
7

rq-vae-transformer

The official implementation of Autoregressive Image Generation using Residual Quantization (CVPR '22)
Jupyter Notebook
669
star
8

mindall-e

PyTorch implementation of a 1.3B text-to-image generation model trained on 14 million image-text pairs
Python
630
star
9

word2word

Easy-to-use word-to-word translations for 3,564 language pairs.
Python
350
star
10

torchlars

A LARS implementation in PyTorch
Python
326
star
11

g2pm

A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset
Python
326
star
12

kor-nlu-datasets

KorNLI and KorSTS: New Benchmark Datasets for Korean Natural Language Understanding
283
star
13

trident

A performance library for machine learning applications.
Python
176
star
14

autoclint

A specially designed light version of Fast AutoAugment
Python
170
star
15

sparse-detr

PyTorch Implementation of Sparse DETR
Python
150
star
16

hotr

Official repository for HOTR: End-to-End Human-Object Interaction Detection with Transformers (CVPR'21, Oral Presentation)
Python
132
star
17

kortok

The code and models for "An Empirical Study of Tokenization Strategies for Various Korean NLP Tasks" (AACL-IJCNLP 2020)
Python
114
star
18

bassl

Python
113
star
19

scrl

PyTorch Implementation of Spatially Consistent Representation Learning(SCRL)
Python
108
star
20

flame

Official implementation of the paper "FLAME: Free-form Language-based Motion Synthesis & Editing"
Python
103
star
21

brain-agent

Brain Agent for Large-Scale and Multi-Task Agent Learning
Python
92
star
22

helo-word

Team Kakao&Brain's Grammatical Error Correction System for the ACL 2019 BEA Shared Task
Python
88
star
23

jejueo

Jejueo Datasets for Machine Translation and Speech Synthesis
Python
74
star
24

solvent

Python
66
star
25

noc

Jupyter Notebook
44
star
26

cxr-clip

Python
43
star
27

expgan

Python
41
star
28

autowu

Official repository for Automated Learning Rate Scheduler for Large-Batch Training (8th ICML Workshop on AutoML)
Python
39
star
29

nvs-adapter

Python
33
star
30

ginr-ipc

The official implementation of Generalizable Implicit Neural Representations with Instance Pattern Composers(CVPR’23 highlight).
Python
30
star
31

coyo-vit

ViT trained on COYO-Labeled-300M dataset
Python
28
star
32

irm-empirical-study

An Empirical Study of Invariant Risk Minimization
Python
28
star
33

coyo-align

ALIGN trained on COYO-dataset
Python
25
star
34

magvlt

The official implementation of MAGVLT: Masked Generative Vision-and-Language Transformer (CVPR'23)
Python
23
star
35

hqtransformer

Locally Hierarchical Auto-Regressive Modeling for Image Generation (HQ-Transformer)
Jupyter Notebook
21
star
36

CheXGPT

Python
18
star
37

learning-loss-for-tta

"Learning Loss for Test-Time Augmentation (NeurIPS 2020)"
Python
9
star
38

stg

Official implementation of Selective Token Generation (COLING'22)
Jupyter Notebook
8
star
39

leco

Official implementation of LECO (NeurIPS'22)
Python
6
star
40

bc-hyperopt-example

brain cloud hyperopt example (mnist)
Python
3
star