• Stars
    star
    201
  • Rank 194,491 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training

ActNN : Activation Compressed Training

This is the official project repository for ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training by Jianfei Chen*, Lianmin Zheng*, Zhewei Yao, Dequan Wang, Ion Stoica, Michael W. Mahoney, and Joseph E. Gonzalez.

TL; DR. ActNN is a PyTorch library for memory-efficient training. It reduces the training memory footprint by compressing the saved activations. ActNN is implemented as a collection of memory-saving layers. These layers have an identical interface to their PyTorch counterparts.

Abstract

The increasing size of neural network models has been critical for improvements in their accuracy, but device memory is not growing at the same rate. This creates fundamental challenges for training neural networks within limited memory environments. In this work, we propose ActNN, a memory-efficient training framework that stores randomly quantized activations for back propagation. We prove the convergence of ActNN for general network architectures, and we characterize the impact of quantization on the convergence via an exact expression for the gradient variance. Using our theory, we propose novel mixed-precision quantization strategies that exploit the activation's heterogeneity across feature dimensions, samples, and layers. These techniques can be readily applied to existing dynamic graph frameworks, such as PyTorch, simply by substituting the layers. We evaluate ActNN on mainstream computer vision models for classification, detection, and segmentation tasks. On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss. ActNN reduces the memory footprint of the activation by 12Γ—, and it enables training with a 6.6Γ— to 14Γ— larger batch size.

mem_speed_r50 Batch size vs. training throughput on ResNet-50. Red cross mark means out-of-memory. The shaded yellow region denotes the possible batch sizes with full precision training. ActNN achieves significantly larger maximum batch size over other state-of-the-art systems and displays a nontrivial trade-off curve.

Install

  • Requirements
torch>=1.7.1,<=1.8.0
torchvision>=0.8.2
ninja>=1.10.1

GPU and CUDA Toolkit are required.

  • Build
git clone [email protected]:ucbrise/actnn.git
cd actnn/actnn
pip install -v -e .

Usage

mem_speed_benchmark/train.py is an example on using ActNN for models from torchvision.

Basic Usage

  • Step1: Configure the optimization level
    ActNN provides several optimization levels to control the trade-off between memory saving and computational overhead. You can set the optimization level by
import actnn
# available choices are ["L0", "L1", "L2", "L3", "L4", "L5"]
actnn.set_optimization_level("L3")

See set_optimization_level for more details.

  • Step2: Convert the model to use ActNN's layers.
model = actnn.QModule(model)

Note:

  1. Convert the model before calling .cuda().
  2. Set the optimization level before invoking actnn.QModule or constructing any ActNN layers.
  3. Automatic model conversion only works with standard PyTorch layers. Please use the modules (nn.Conv2d, nn.ReLU, etc.), not the functions (F.conv2d, F.relu).
  • Step3: Print the model to confirm that all the modules (Conv2d, ReLU, BatchNorm) are correctly converted to ActNN layers.
print(model)    # Should be actnn.QConv2d, actnn.QBatchNorm2d, etc.

Advanced Features

  • Convert the model manually.
    ActNN is implemented as a collection of memory-saving layers, including actnn.QConv1d, QConv2d, QConv3d, QConvTranspose1d, QConvTranspose2d, QConvTranspose3d, QBatchNorm1d, QBatchNorm2d, QBatchNorm3d, QLinear, QReLU, QSyncBatchNorm, QMaxPool2d. These layers have identical interface to their PyTorch counterparts. You can construct the model manually using these layers as the building blocks. See ResNetBuilder and resnet_configs in image_classification/image_classification/resnet.py for example.
  • (Optional) Change the data loader
    If you want to use per-sample gradient information for adaptive quantization, you have to update the dataloader to return sample indices. You can see train_loader in mem_speed_benchmark/train.py for example. In addition, you have to update the configurations.
from actnn import config, QScheme
config.use_gradient = True
QScheme.num_samples = 1300000   # the size of training set

You can find sample code in the above script.

Examples

Benchmark Memory Usage and Training Speed

See mem_speed_benchmark. Please do NOT measure the memory usage by nvidia-smi. nvidia-smi reports the size of the memory pool allocated by PyTorch, which can be much larger than the size of acutal used memory.

Image Classification

See image_classification

Object Detection, Semantic Segmentation, Self-Supervised Learning, ...

Here is the example memory-efficient training for ResNet50, built upon the OpenMMLab toolkits. We use ActNN with the default optimization level (L3). Our training runs are available at Weights & Biases.

Installation

  1. Install mmcv
export MMCV_ROOT=/path/to/clone/actnn-mmcv
git clone https://github.com/DequanWang/actnn-mmcv $MMCV_ROOT
cd $MMCV_ROOT
MMCV_WITH_OPS=1 MMCV_WITH_ORT=0 pip install -e .
  1. Install mmdet, mmseg, mmssl, ...
export MMDET_ROOT=/path/to/clone/actnn-mmdet
git clone https://github.com/DequanWang/actnn-mmdet $MMDET_ROOT
cd $MMDET_ROOT
python setup.py develop
export MMSEG_ROOT=/path/to/clone/actnn-mmseg
git clone https://github.com/DequanWang/actnn-mmseg $MMSEG_ROOT
cd $MMSEG_ROOT
python setup.py develop
export MMSSL_ROOT=/path/to/clone/actnn-mmssl
git clone https://github.com/DequanWang/actnn-mmssl $MMSSL_ROOT
cd $MMSSL_ROOT
python setup.py develop

Single GPU training

cd $MMDET_ROOT
python tools/train.py configs/actnn/faster_rcnn_r50_fpn_1x_coco_1gpu.py
# https://wandb.ai/actnn/detection/runs/ye0aax5s
# ActNN mAP 37.4 vs Official mAP 37.4
python tools/train.py configs/actnn/retinanet_r50_fpn_1x_coco_1gpu.py
# https://wandb.ai/actnn/detection/runs/1x9cwokw
# ActNN mAP 36.3 vs Official mAP 36.5
cd $MMSEG_ROOT
python tools/train.py configs/actnn/fcn_r50-d8_512x1024_80k_cityscapes_1gpu.py
# https://wandb.ai/actnn/segmentation/runs/159if8da
# ActNN mIoU 72.9 vs Official mIoU 73.6
python tools/train.py configs/actnn/fpn_r50_512x1024_80k_cityscapes_1gpu.py
# https://wandb.ai/actnn/segmentation/runs/25j9iyv3
# ActNN mIoU 74.7 vs Official mIoU 74.5

Multiple GPUs training

cd $MMSSL_ROOT
bash tools/dist_train.sh configs/selfsup/actnn/moco_r50_v2_bs512_e200_imagenet_2gpu.py 2
# https://wandb.ai/actnn/mmssl/runs/lokf7ydo
# https://wandb.ai/actnn/mmssl/runs/2efmbuww
# ActNN top1 67.3 vs Official top1 67.7

For more detailed guidance, please refer to the docs of mmcv, mmdet, mmseg, mmssl.

FAQ

  1. Does ActNN supports CPU training?
    Currently, ActNN only supports CUDA.

  2. Accuracy degradation / diverged training with ActNN.
    ActNN applies lossy compression to the activations. In some challenging cases, our default compression strategy might be too aggressive. In this case, you may try more conservative compression strategies (which consume more memory):

    • 4-bit per-group quantization
    actnn.set_optimization_level("L2")
    • 8-bit per-group quantization
    actnn.set_optimization_level("L2")
    actnn.config.activation_compression_bits = [8]

    If none of these works, you may report to us by creating an issue.

Citation

If the actnn library is helpful in your research, please consider citing our paper:

@inproceedings{chen2021actnn,
  title={ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training},
  author={Chen, Jianfei and Zheng, Lianmin and Yao, Zhewei and Wang, Dequan and Stoica, Ion and Mahoney, Michael W and Gonzalez, Joseph E},
  booktitle={International Conference on Machine Learning},
  year={2021}
}

More Repositories

1

confluo

Real-time Monitoring and Analysis of Data Streams
C++
1,442
star
2

clipper

A low-latency prediction-serving system
C++
1,399
star
3

anna

Java
448
star
4

cs294-ai-sys-sp19

CS294; AI For Systems and Systems For AI
HTML
220
star
5

flor

🌻 Flow with FlorDB
Jupyter Notebook
151
star
6

graphtrans

Representing Long-Range Context for Graph Neural Networks with Global Attention
Python
122
star
7

cirrus

Serverless ML Framework
C++
107
star
8

piranha

Piranha: A GPU Platform for Secure Computation
C++
88
star
9

cpp_project_template

Minimal C++ Project Template
C++
38
star
10

risecamp

RISECamp Tutorials
Jupyter Notebook
38
star
11

jedi-pairing

Go, C++, and C implementation of a bilinear group and pairing-based cryptography for both embedded and non-embedded systems
C++
34
star
12

dory

Go
32
star
13

cs294-rise-fa16

CS294 RISE Course Material
32
star
14

cs294-ai-sys-fa19

CS294-162; Machine Learning Systems Seminar
HTML
31
star
15

hypersched

Deadline-based hyperparameter tuning on RayTune.
Python
31
star
16

LatticeFlow

C++
28
star
17

mage

MAGE: Memory-Aware Garbling Engine
C++
25
star
18

aws-audit

aws consolidated billing audit/reporting tool
Python
24
star
19

waldo

C++
23
star
20

snoopy

A high-throughput oblivious storage system
C
23
star
21

MerkleSquare

A Go library for MerkleSquare: A Low-Latency Transparency Log System
Go
20
star
22

clipper-tutorials

Jupyter Notebook
17
star
23

fluent-old

Bloom + C++
JavaScript
17
star
24

cs294-ai-sys-sp22

CS294 AI Systems Class Website
SCSS
15
star
25

jarvis

Build, configure, and track workflows with Jarvis.
Python
13
star
26

caravel

Studying GPU Multi-tenancy
Jupyter Notebook
12
star
27

jedi-protocol-go

Golang implementation of JEDI: Many-to-Many End-to-End Encryption and Key Delegation for IoT
Go
12
star
28

costco

Python
10
star
29

cs262a-fall2020

CS 262a Fall 2020
HTML
6
star
30

tcplp

Performant TCP for Low-Power Wireless Networks
5
star
31

mage-scripts

Benchmarking scripts for MAGE
Python
5
star
32

clipper-website

Hugo sources for clipper.ai website
CSS
4
star
33

kaggle-nlp-disasters

https://www.kaggle.com/c/nlp-getting-started
Jupyter Notebook
3
star
34

clipper-serving-testbed

Python
2
star
35

fluent_scala

Scala
2
star
36

netplay

C++
1
star
37

buddy

An Python/SQL DSL for CALM cloud programming
Python
1
star
38

flor-camp2018

Jupyter Notebook
1
star
39

maceta

Delta Pruning: pruning spurious/deleterious changes that arise organically from iterative model development
1
star