• Stars
    star
    135
  • Rank 261,925 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created about 5 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Practical low-rank gradient compression for distributed optimization: https://arxiv.org/abs/1905.13727

PowerSGD

Practical Low-Rank Gradient Compression for Distributed Optimization

Video

Abstract: We study gradient compression methods to alleviate the communication bottleneck in data-parallel distributed optimization. Despite the significant attention received, current compression schemes either do not scale well or fail to achieve the target test accuracy. We propose a new low-rank gradient compressor based on power iteration that can i) compress gradients rapidly, ii) efficiently aggregate the compressed gradients using all-reduce, and iii) achieve test performance on par with SGD. The proposed algorithm is the only method evaluated that achieves consistent wall-clock speedups when benchmarked against regular SGD with an optimized communication backend. We demonstrate reduced training times for convolutional networks as well as LSTMs on common datasets.

Reference implementation

This is a reference implementation for the PowerSGD algorithm.

Installation:

pip install git+https://github.com/epfml/powersgd.git

Usage:

+ from powersgd import PowerSGD, Config, optimizer_step

  model = torchvision.models.resnet50(pretrained=True)
  params = list(model.parameters())
  optimizer = torch.optim.SGD(params, lr=0.1)

+ powersgd = PowerSGD(params, config=Config(
+     rank=1,  # lower rank => more aggressive compression
+     min_compression_rate=10,  # don't compress gradients with less compression
+     num_iters_per_step=2,  #   # lower number => more aggressive compression
+     start_compressing_after_num_steps=0,
+ ))

  for each batch:
      loss = ...
-     optimizer.zero_grad()
      loss.backward()
-     optimizer.step()
+     optimizer_step(optimizer, powersgd)

PyTorch implementation

PyTorch features an implementation of PowerSGD as a communucation hook for DistributedDataParallel models. Because of the integration with DDP, the code is more involved than the code in this repository.

Research code

Research code for the experiments in the PowerSGD paper is located under paper-code.

Selected follow-up work

  • (Cho et al., 2019) concurrently developed an algorithm that is fundamentally very similar to PowerSGD.
  • (Ramesh et al., 2021 - DALL-E) share valuable recommendations in using PowerSGD for large-scale transformer training.
  • (Agarwal et al., 2020) share insights into adaptive compression with PowerSGD.
  • (Vogels et al., 2020) adapt PowerSGD to work in a decentralized setting (with sparse connectivity between workers.)
  • (Wang, 2021) introduces a variation to PowerSGD and describes his experience with PowerSGD on large language models.
  • (Please submit a PR if you want your work to be included here.)

Reference

If you use this code, please cite the following paper

@inproceedings{vkj2019powersgd,
  author = {Vogels, Thijs and Karimireddy, Sai Praneeth and Jaggi, Martin},
  title = "{{PowerSGD}: Practical Low-Rank Gradient Compression for Distributed Optimization}",
  booktitle = {NeurIPS 2019 - Advances in Neural Information Processing Systems},
  year = 2019,
  url = {https://arxiv.org/abs/1905.13727}
}

More Repositories

1

sent2vec

General purpose unsupervised sentence representations
C++
1,187
star
2

ML_course

EPFL Machine Learning Course, Fall 2023
Jupyter Notebook
1,169
star
3

attention-cnn

Source code for "On the Relationship between Self-Attention and Convolutional Layers"
Python
1,062
star
4

OptML_course

EPFL Course - Optimization for Machine Learning - CS-439
Jupyter Notebook
1,049
star
5

landmark-attention

Landmark Attention: Random-Access Infinite Context Length for Transformers
Python
258
star
6

federated-learning-public-code

Python
154
star
7

collaborative-attention

Code for Multi-Head Attention: Collaborate Instead of Concatenate
Python
146
star
8

disco

Decentralized & federated privacy-preserving ML training, using p2p networking, in JS
TypeScript
126
star
9

dynamic-sparse-flash-attention

Jupyter Notebook
112
star
10

DenseFormer

Python
68
star
11

ChocoSGD

Decentralized SGD and Consensus with Communication Compression: https://arxiv.org/abs/1907.09356
Python
59
star
12

llm-baselines

Python
55
star
13

sparsifiedSGD

Sparsified SGD with Memory: https://arxiv.org/abs/1809.07599
Jupyter Notebook
54
star
14

optML-pku

summer school materials
42
star
15

LocalSGD-Code

Python
41
star
16

error-feedback-SGD

SGD with compressed gradients and error-feedback: https://arxiv.org/abs/1901.09847
Jupyter Notebook
28
star
17

Bi-Sent2Vec

Robust Cross-lingual Embeddings from Parallel Sentences
C++
20
star
18

byzantine-robust-optimizer

Learning from history for Byzantine Robustness
Jupyter Notebook
20
star
19

opt-summerschool

Short Course on Optimization for Machine Learning - Slides and Practical Labs - DS3 Data Science Summer School, June 24 to 28, 2019, Paris, France
Jupyter Notebook
20
star
20

interpret-lm-knowledge

Extracting knowledge graphs from language models as a diagnostic benchmark of model performance (NeurIPS XAI 2021).
Jupyter Notebook
20
star
21

cola

CoLa - Decentralized Linear Learning: https://arxiv.org/abs/1808.04883
Python
18
star
22

opt-shortcourse

Short Course on Optimization for Machine Learning - Slides and Practical Lab - Pre-doc Summer School on Learning Systems, July 3 to 7, 2017, Zürich, Switzerland
Jupyter Notebook
18
star
23

powergossip

Code for "Practical Low-Rank Communication Compression in Decentralized Deep Learning"
Python
14
star
24

byzantine-robust-noniid-optimizer

Python
14
star
25

X2Static

X2Static embeddings
Python
12
star
26

kubernetes-setup

MLO group setup for kubernetes cluster
Dockerfile
12
star
27

relaysgd

Code for the paper “RelaySum for Decentralized Deep Learning on Heterogeneous Data”
Jupyter Notebook
10
star
28

topology-in-decentralized-learning

Code related to ’Beyond spectral gap: The role of the topology in decentralized learning‘.
Python
10
star
29

quasi-global-momentum

Python
9
star
30

piecewise-affine-multiplication

Python
7
star
31

rotational-optimizers

Python
6
star
32

byzantine-robust-decentralized-optimizer

Jupyter Notebook
6
star
33

uncertainity-estimation

Code for the paper “The Peril of Popular Deep Learning Uncertainty Estimation Methods”
Jupyter Notebook
6
star
34

getting-started

Python
6
star
35

text_to_image_generation

Python
5
star
36

easy-summary

difficulty-guided text summarization
Python
5
star
37

FeAI

Federated Learning with TensorFlow.js
Vue
4
star
38

autoTrain

Open Challenge - Automatic Training for Deep Learning
Python
3
star
39

ghost-noise

Python
3
star
40

pax

JAX-like API for PyTorch
Python
3
star
41

personalized-collaborative-llms

Python
2
star
42

phantomedicus

MedSurge: medical survey generator
Jupyter Notebook
1
star