• Stars
    star
    204
  • Rank 192,063 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pytorch Lightning Distributed Accelerators using Ray

Distributed PyTorch Lightning Training on Ray

This library adds new PyTorch Lightning strategies for distributed training using the Ray distributed computing framework.

These PyTorch Lightning strategies on Ray enable quick and easy parallel training while still leveraging all the benefits of PyTorch Lightning and using your desired training protocol, either PyTorch Distributed Data Parallel or Horovod.

Once you add your strategy to the PyTorch Lightning Trainer, you can parallelize training to all the cores in your laptop, or across a massive multi-node, multi-GPU cluster with no additional code changes.

This library also comes with an integration with Ray Tune for distributed hyperparameter tuning experiments.

Table of Contents

  1. Installation
  2. PyTorch Lightning Compatibility
  3. PyTorch Distributed Data Parallel Plugin on Ray
  4. Multi-Node Distributed Training
  5. Multi-Node Training from your Laptop
  6. Horovod Plugin on Ray
  7. Model Parallel Sharded Training on Ray
  8. Hyperparameter Tuning with Ray Tune
  9. FAQ

Installation

You can install Ray Lightning via pip:

pip install ray_lightning

Or to install master:

pip install git+https://github.com/ray-project/ray_lightning#ray_lightning

PyTorch Lightning Compatibility

Here are the supported PyTorch Lightning versions:

Ray Lightning PyTorch Lightning
0.1 1.4
0.2 1.5
0.3 1.6
master 1.6

PyTorch Distributed Data Parallel Strategy on Ray

The RayStrategy provides Distributed Data Parallel training on a Ray cluster. PyTorch DDP is used as the distributed training protocol, and Ray is used to launch and manage the training worker processes.

Here is a simplified example:

import pytorch_lightning as pl
from ray_lightning import RayStrategy

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)
strategy = RayStrategy(num_workers=4, num_cpus_per_worker=1, use_gpu=True)

# Don't set ``gpus`` in the ``Trainer``.
# The actual number of GPUs is determined by ``num_workers``.
trainer = pl.Trainer(..., strategy=strategy)
trainer.fit(ptl_model)

Because Ray is used to launch processes, instead of the same script being called multiple times, you CAN use this strategy even in cases when you cannot use the standard DDPStrategy such as

  • Jupyter Notebooks, Google Colab, Kaggle
  • Calling fit or test multiple times in the same script

Multi-node Distributed Training

Using the same examples above, you can run distributed training on a multi-node cluster with just a couple simple steps.

First, use Ray's Cluster launcher to start a Ray cluster:

ray up my_cluster_config.yaml

Then, run your Ray script using one of the following options:

  1. on the head node of the cluster (python train_script.py)
  2. via ray job submit (docs) from your laptop (ray job submit -- python train.py)

Multi-node Training from your Laptop

Ray provides capabilities to run multi-node and GPU training all from your laptop through

[Ray Client](https://docs.ray.io/en/master/cluster/ray-client.html)

Ray's Cluster launcher to setup the cluster. Then, add this line to the beginning of your script to connect to the cluster:

import ray
# replace with the appropriate host and port
ray.init("ray://<head_node_host>:10001")

Now you can run your training script on the laptop, but have it execute as if your laptop has all the resources of the cluster essentially providing you with an infinite laptop.

Note: When using with Ray Client, you must disable checkpointing and logging for your Trainer by setting checkpoint_callback and logger to False.

Horovod Strategy on Ray

Or if you prefer to use Horovod as the distributed training protocol, use the HorovodRayStrategy instead.

import pytorch_lightning as pl
from ray_lightning import HorovodRayStrategy

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)

# 2 workers, 1 CPU and 1 GPU each.
strategy = HorovodRayStrategy(num_workers=2, use_gpu=True)

# Don't set ``gpus`` in the ``Trainer``.
# The actual number of GPUs is determined by ``num_workers``.
trainer = pl.Trainer(..., strategy=strategy)
trainer.fit(ptl_model)

Model Parallel Sharded Training on Ray

The RayShardedStrategy integrates with FairScale to provide sharded DDP training on a Ray cluster. With sharded training, leverage the scalability of data parallel training while drastically reducing memory usage when training large models.

import pytorch_lightning as pl
from ray_lightning import RayShardedStrategy

# Create your PyTorch Lightning model here.
ptl_model = MNISTClassifier(...)
strategy = RayShardedStrategy(num_workers=4, num_cpus_per_worker=1, use_gpu=True)

# Don't set ``gpus`` in the ``Trainer``.
# The actual number of GPUs is determined by ``num_workers``.
trainer = pl.Trainer(..., strategy=strategy)
trainer.fit(ptl_model)

See the Pytorch Lightning docs for more information on sharded training.

Hyperparameter Tuning with Ray Tune

ray_lightning also integrates with Ray Tune to provide distributed hyperparameter tuning for your distributed model training. You can run multiple PyTorch Lightning training runs in parallel, each with a different hyperparameter configuration, and each training run parallelized by itself. All you have to do is move your training code to a function, pass the function to tune.run, and make sure to add the appropriate callback (Either TuneReportCallback or TuneReportCheckpointCallback) to your PyTorch Lightning Trainer.

Example using ray_lightning with Tune:

from ray import tune

from ray_lightning import RayStrategy
from ray_lightning.examples.ray_ddp_example import MNISTClassifier
from ray_lightning.tune import TuneReportCallback, get_tune_resources

import pytorch_lightning as pl


def train_mnist(config):
    
    # Create your PTL model.
    model = MNISTClassifier(config)

    # Create the Tune Reporting Callback
    metrics = {"loss": "ptl/val_loss", "acc": "ptl/val_accuracy"}
    callbacks = [TuneReportCallback(metrics, on="validation_end")]
    
    trainer = pl.Trainer(
        max_epochs=4,
        callbacks=callbacks,
        strategy=RayStrategy(num_workers=4, use_gpu=False))
    trainer.fit(model)
    
config = {
    "layer_1": tune.choice([32, 64, 128]),
    "layer_2": tune.choice([64, 128, 256]),
    "lr": tune.loguniform(1e-4, 1e-1),
    "batch_size": tune.choice([32, 64, 128]),
}

# Make sure to pass in ``resources_per_trial`` using the ``get_tune_resources`` utility.
analysis = tune.run(
        train_mnist,
        metric="loss",
        mode="min",
        config=config,
        num_samples=2,
        resources_per_trial=get_tune_resources(num_workers=4),
        name="tune_mnist")
        
print("Best hyperparameters found were: ", analysis.best_config)

Note: Ray Tune requires 1 additional CPU per trial to use for the Trainable driver. So the actual number of resources each trial requires is num_workers * num_cpus_per_worker + 1.

FAQ

I see that RayStrategy is based off of Pytorch Lightning's DDPSpawnStrategy. However, doesn't the PTL team discourage the use of spawn?

As discussed here, using a spawn approach instead of launch is not all that detrimental. The original factors for discouraging spawn were:

  1. not being able to use 'spawn' in a Jupyter or Colab notebook, and
  2. not being able to use multiple workers for data loading.

Neither of these should be an issue with the RayStrategy due to Ray's serialization mechanisms. The only thing to keep in mind is that when using this strategy, your model does have to be serializable/pickleable.

Horovod installation issue please see details

More Repositories

1

ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
Python
33,272
star
2

llm-numbers

Numbers every LLM developer should know
4,053
star
3

kuberay

A toolkit to run Ray applications on Kubernetes
Go
1,213
star
4

ray-llm

RayLLM - LLMs on Ray
Python
1,213
star
5

tutorial

Jupyter Notebook
777
star
6

llmperf

LLMPerf is a library for validating and benchmarking LLMs
Python
583
star
7

tune-sklearn

A drop-in replacement for Scikit-Learn’s GridSearchCV / RandomizedSearchCV -- but with cutting edge hyperparameter tuning techniques.
Python
465
star
8

llmperf-leaderboard

417
star
9

ray-educational-materials

This is suite of the hands-on training materials that shows how to scale CV, NLP, time-series forecasting workloads with Ray.
Jupyter Notebook
272
star
10

langchain-ray

Examples on how to use LangChain and Ray
Python
202
star
11

deltacat

A portable Pythonic Data Catalog API powered by Ray that brings exabyte-level scalability and fast, ACID-compliant, change-data-capture to your big data workloads.
Python
155
star
12

rl-experiments

Keeping track of RL experiments
148
star
13

xgboost_ray

Distributed XGBoost on Ray
Python
137
star
14

rayfed

A multiple parties joint, distributed execution engine based on Ray, to help build your own federated learning frameworks in minutes.
Python
92
star
15

mobius

Mobius is an AI infrastructure platform for distributed online learning, including online sample processing, training and serving.
Java
88
star
16

plasma

A minimal shared memory object store design
C
46
star
17

enhancements

Tracking Ray Enhancement Proposals
44
star
18

lightgbm_ray

LightGBM on Ray
Python
44
star
19

ray_beam_runner

Ray-based Apache Beam runner
Python
40
star
20

mlflow-ray-serve

MLFlow Deployment Plugin for Ray Serve
Python
35
star
21

distml

Distributed ML Optimizer
Python
31
star
22

llms-in-prod-workshop-2023

Deploy and Scale LLM-based applications
Jupyter Notebook
23
star
23

ray-legacy

An experimental distributed execution engine
Python
21
star
24

ray_shuffling_data_loader

A Ray-based data loader with per-epoch shuffling and configurable pipelining, for shuffling and loading training data for distributed training of machine learning models.
Python
18
star
25

pygloo

Pygloo provides Python bindings for Gloo.
C++
15
star
26

contrib-workflow-dag

Python
11
star
27

anyscale-berkeley-ai-hackathon

Ray and Anyscale for UC Berkeley AI Hackathon!
Jupyter Notebook
11
star
28

credis

C++
9
star
29

ray-acm-workshop-2023

Scalable/Distributed Computer Vision with Ray
Jupyter Notebook
9
star
30

spark-ray-example

A simple demonstration of embedding Ray in a Spark UDF. For Spark + AI Summit 2020.
Jupyter Notebook
8
star
31

community

Artifacts intended to support the Ray Developer Community: SIGs, RFC overviews, and governance. We're very glad you're here! ✨
8
star
32

llm-application

Jupyter Notebook
6
star
33

releaser

Python
5
star
34

scalable-learning

Scaling multi-node multi-GPU workloads
5
star
35

air-reference-arch

Jupyter Notebook
5
star
36

serve-movie-rec-demo

Python
5
star
37

raynomics

Experimental genomics algorithms in Ray
Python
5
star
38

maze-raylit

Hackathon 2020! Max Archit Zhe
Python
5
star
39

ray-serve-arize-observe

Building Real-Time Inference Pipelines with Ray Serve
Jupyter Notebook
5
star
40

sandbox

Ray repository sandbox
Python
5
star
41

anyscale-workshop-nyc-2023

Scalable NLP model fine-tuning and batch inference with Ray and Anyscale
Jupyter Notebook
5
star
42

kuberay-helm

Helm charts for the KubeRay project
Mustache
4
star
43

ray-saturday-dec-2022

Ray Saturday Dec 2022 edition
Jupyter Notebook
4
star
44

RFC

Community Documents
4
star
45

ray-demos

Collection of demos build with Ray
Jupyter Notebook
4
star
46

prototype_gpu_buffer

Python
3
star
47

arrow-build

Queue for building arrow
3
star
48

numbuf

Serializing primitive Python types in Arrow
C++
3
star
49

odsc-west-workshop-2023

Jupyter Notebook
3
star
50

scipy-ray-scalable-ml-tutorial-2023

Jupyter Notebook
2
star
51

2022_04_13_ray_serve_meetup_demo

Code samples for Ray Serve Meetup on 04/13/2022
Python
2
star
52

q4-2021-docs-hackathon

HTML
2
star
53

ray-scripts

Experimental scripts for deploying and using Ray
Shell
2
star
54

raytracer

Polymer WebUI for Ray
HTML
2
star
55

travis-tracker-v2

Python
2
star
56

rllib-contrib

Python
2
star
57

serve_workloads

Python
2
star
58

qcon-workshop-2023

Jupyter Notebook
2
star
59

travis-tracker

Dashboard for Tracking Travis Python Test Result.
TypeScript
1
star
60

common

Code that is shared between Ray projects
C
1
star
61

photon

A local scheduler and node manager for Ray
C
1
star
62

spmd_grid

Grid-style gang-scheduling and collective communication for Ray
Python
1
star
63

checkstyle_java

Python
1
star
64

raylibs

Libraries for Ray
1
star
65

issues-to-airtable

JavaScript
1
star
66

ray-docs-zh

Chinese translation of Ray documentation. This may not be update to date.
1
star
67

streaming

Streaming processing engine based on ray platform.
1
star
68

ray-project.github.io

The Ray project website
HTML
1
star
69

train-serve-primer

Jupyter Notebook
1
star
70

serve_config_examples

Python
1
star
71

llmval-legacy

Jupyter Notebook
1
star
72

Ray-Forward

Some resources about Ray Forward Meetup
1
star
73

ray-summit-2022

Website for Ray Summit 2022
HTML
1
star