• Stars
    star
    195
  • Rank 199,374 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 9 years ago
  • Updated about 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Multi-GPU mini-framework for Theano

platoon

Experimental multi-GPU mini-framework for Theano

It supports data-parallelism inside one compute node, not model-parallelism. For model-parallelism check Theano multiple GPUs tutorial.

In Platoon, there are two main components : workers, and controllers. Workers do the bulk of the work (training, monitoring, ...). Controllers interact with multiple workers to coordinate their work, collect the results and decide how to act on them. To use Platoon, you will need to write code which uses a worker. You can also extend the functionality of a worker or a controller by implementing your own. Platoon provides helper classes to facilitate this.

This framework is under development. Its interface is not polished and it is likely to undergo changes in the future.

The framework provides two separate worker interfaces that allow user to implement multiple data-parallel algorithms: param_sync and all_reduce. The default interface is param_sync. Installing optional dependencies listed in the features table below will make all_reduce interface available too.

Interface sync type multi-node Theano Ops extra dependencies
param_sync sync/async no no no
all_reduce sync only yes (if mpi4py is installed) yes NCCL, pygpu, Theano

There are currently two algorithms for distributed gradient descent implemented with param_sync interface and three with all_reduce interface.

  • param_sync: EASGD and ASGD.
  • all_reduce: Synchronous sum/average SGD, EASGD and a synchronous variant of Downpour

There are working examples in the examples directory.

The steps below describe what needs to be done to use Platoon for data-parallelism. The LSTM example in the folder 'example' was implemented following these steps and should be referred to for guidance.

Install

You can simply install it using pip. pip install git+https://github.com/mila-udem/platoon

If you would like to use the examples or help develop platoon first you have to clone the repo.

git clone https://github.com/mila-udem/platoon

Then install what you just cloned.

pip install -e <path-to-platoon-folder>

Usage

The simplest way to launch a multi-gpu experiment is to first implement a controller and a worker as described below and then launch it using the platoon-launcher. It is not necessary that you have implemented a controller file if you want to use the existing controller functionality.

The launcher assume that you named both files as such: <experiment-name>_controller.py and <experiment-name>_worker.py.

Then to launch the experiment you just need to specify the experiment name and GPUs you want to use:

platoon-launcher <experiment-name> -D gpu0 gpu1

You can also omit the -D argument and let launcher find all available CUDA GPUs to use in the single-node experiment:

platoon-launcher <experiment-name>

For more configuration options, see platoon-launcher -h.

Implementing a controller

These steps describe how to implement the Python script that will launch your controller. In the included LSTM example, both of these steps are done in the file lstm_controller.py

  1. Define which commands your controller can receive and how it responds to them. Commands starting by "platoon-" are reserved by platoon.

This is done by creating a new class that inherits from channel.Controller and having it override the method handle_control() which will be called whenever your controller receives a request from a worker.

  1. Instantiate and launch your custom controller.

Create a script that will instantiate your custom controller. Once this is done, define the port on which the controller should listen by calling the function init_control. Finally, call your controller's serve method which will make him ready to receive requests from workers.

Implementing the workers

These steps describe how to start with a script that performs stand-alone training of a machine learning model and adapt it to serve as a worker in Platoon.

  1. Add a new parameter to the script which will be used during execution to know whether the worker is the first one to be launched and should create the central parameters or not.

  2. Before entering the main loop, the script must create an instance of the class channel.Worker, providing it with the same port number as used to initialize the controller. It is not necessary to sub-class Worker, you can instantiate it directly. This object will provide the necessary methods to handle communication with the controller.

  3. After the model has been built and the parameters initialized, initialize the central parameters by calling the Worker's init_shared_params() method. Every worker should call this method.

  4. In the main loop, instead of deciding when to train and when to monitor performance, the worker should send control request to the controller to know what action it should take, according to the communication protocol established in the controller's handle_control() method.

  5. In the main loop, whenever the worker has performed N (a hyper-parameter) iterations of training, it should synchronize it's parameters with the central parameters using it's Worker's sync_params() method.

Real usage consideration

The optimal (as in more efficient for learning) hyper-parameters values are dependent on the number of workers. At least, consider tuning the learning rate and the alpha parameter of EASGD.

How to change the alpha hyper-parameter isn't clear. An alpha of 0.5 for the LSTM example with 2 workers seem to have good training efficiency for this model/dataset/hyper-parameter combination.

Using alpha = 1/N (with N being the number of workers) might be a reasonable guideline but the experiments performed with Platoon are insufficient to conclude anything.

In the EASGD paper it is shown that in some cases a larger number of workers can result in a better test error.

Examples

For param sync interface, see example/lstm/ folder.

For all reduce interface, see example/synchronous_lstm/ folder.

More Repositories

1

blocks

A Theano framework for building and training neural networks
Python
1,155
star
2

welcome_tutorials

Various tutorials given for welcoming new students at MILA.
Jupyter Notebook
985
star
3

fuel

A data pipeline framework for machine learning
Python
864
star
4

babyai

BabyAI platform. A testbed for training agents to understand and execute language commands.
Python
681
star
5

myia

Myia prototyping
Python
455
star
6

summerschool2015

Slides and exercises for the Deep Learning Summer School 2015 programming tutorials
Jupyter Notebook
391
star
7

atari-representation-learning

Code for "Unsupervised State Representation Learning in Atari"
Python
241
star
8

spr

Code for "Data-Efficient Reinforcement Learning with Self-Predictive Representations"
Python
155
star
9

blocks-examples

Examples and scripts using Blocks
Python
147
star
10

summerschool2016

Montréal Deep Learning Summer School 2016 material
Jupyter Notebook
100
star
11

paperoni

Search for scientific papers on the command line
Python
97
star
12

summerschool2017

Material for the Montréal Deep Learning Summer School 2017
Jupyter Notebook
78
star
13

gene-graph-conv

Towards Gene Expression Convolutions using Gene Interaction Graphs
Jupyter Notebook
74
star
14

milatools

Tools to connect to and interact with the Mila cluster
Python
60
star
15

Conscious-Planning

Implementation for paper "A Consciousness-Inspired Planning Agent for Model-Based Reinforcement Learning".
Python
58
star
16

SGI

Official code for "Pretraining Representations For Data-Efficient Reinforcement Learning" (NeurIPS 2021)
Python
51
star
17

ddxplus

Python
48
star
18

picklable-itertools

itertools. But picklable.
Python
38
star
19

climate-cooperation-competition

AI for Global Climate Cooperation: Modeling Global Climate Negotiations, Agreements, and Long-Term Cooperation in RICE-N. ai4climatecoop.org
Python
35
star
20

ivado-mila-dl-school-2019

IVADO/ Mila's Summer Deep Learning School
Jupyter Notebook
35
star
21

ivado-mila-dl-school-2021

Jupyter Notebook
33
star
22

blocks-extras

A collection of extensions to the Blocks framework
Python
27
star
23

DeepDrummer

Making the world a better place through AI-generated beats & grooves
Python
26
star
24

covid_p2p_risk_prediction

COVID19 P2P Risk Prediction Model & Dataset
Python
22
star
25

COVI-AgentSim

Covid-19 spread simulator with human mobility and intervention modeling.
Jupyter Notebook
20
star
26

Skipper

A PyTorch Implementation of Skipper
Python
20
star
27

cookiecutter-pyml

Python
19
star
28

milabench

Repository of machine learning benchmarks
Python
17
star
29

snektalk

Python
15
star
30

dlschool-ivadofr-a18

Ecole Mila/IVADO
Jupyter Notebook
12
star
31

COVI-ML

Risk model training code for Covid-19 tracing application.
Python
12
star
32

teamgrid

Multiagent gridworld for the TEAM project based on gym-minigrid
Python
12
star
33

ivado-mila-dl-school-2019-vancouver

Jupyter Notebook
11
star
34

mila-paper-webpage

Webpage template for MILA-affiliated papers
CSS
11
star
35

dlschool-ivadofr-h18

Ivado École d'hiver IVADO/MILA en apprentissage profond 2018
Jupyter Notebook
10
star
36

giving

Reactive logging
Python
9
star
37

training

Python
8
star
38

mila-docs

Mila technical documentation
8
star
39

Casande-RL

Casande-RL
Python
8
star
40

hardpicks

Deep learning dataset and benchmark for first-break detection from hardrock seismic reflection data
Python
7
star
41

ptera

Query and override internal variables in your programs
Python
5
star
42

ResearchTemplate

WIP: Research Template Repository
Python
5
star
43

mila_datamodules

Efficient Datamodules Customized for the Mila / CC clusters
Python
4
star
44

digit-detection

IFT6759 - Advanced projects in machine learning (Door Number Detection project)
Shell
4
star
45

Humanitarian_R-D

Jupyter Notebook
3
star
46

ansible-role-clockwork

Ansible role to install and configure clockwork
Jinja
3
star
47

SARC

FD#11499
Python
3
star
48

ansible-role-cobbler

Install and configure Cobbler service
Jinja
3
star
49

slurm-queue-time-pred

Slurm wait time prediction
Python
3
star
50

diffusion_for_multi_scale_molecular_dynamics

Python
3
star
51

cableinspect-ad-code

Code to prepare data and reproduce results from CableInspect-AD paper
Python
3
star
52

clockwork

Simple metrics to monitor slurm and produce reports.
Python
2
star
53

ansible-role-infiniband

Ansible role to configure InfiniBand interfaces
Jinja
2
star
54

tensorflow_dataloader

1
star
55

bcachefs

C implementation with Python 3.7 bindings of the BCacheFS
C
1
star
56

ansible-collection-proxmox

Ansible Collection to manage containers and virtual machines with Proxmox VE
1
star
57

mila-docs-chatbot

Python
1
star