MLCommons™ Algorithmic Efficiency
Paper (arXiv) • Installation • Rules • Contributing • License
MLCommons Algorithmic Efficiency is a benchmark and competition measuring neural network training speedups due to algorithmic improvements in both training algorithms and models. This repository holds the competition rules and the benchmark code to run it. For a detailed description of the benchmark design, see our paper.
Table of Contents
- Table of Contents
- AlgoPerf Benchmark Workloads
- Installation
- Getting Started
- Rules
- Contributing
- Citing AlgoPerf Benchmark
Installation
You can install this package and dependences in a python virtual environment or use a Docker container (recommended).
TL;DR to install the Jax version for GPU run:
pip3 install -e '.[pytorch_cpu]'
pip3 install -e '.[jax_gpu]' -f 'https://storage.googleapis.com/jax-releases/jax_cuda_releases.html'
pip3 install -e '.[full]'
TL;DR to install the PyTorch version for GPU run:
pip3 install -e '.[jax_cpu]'
pip3 install -e '.[pytorch_gpu]' -f 'https://download.pytorch.org/whl/torch_stable.html'
pip3 install -e '.[full]'
Virtual environment
Note: Python minimum requirement >= 3.8
To set up a virtual enviornment and install this repository
-
Create new environment, e.g. via
conda
orvirtualenv
sudo apt-get install python3-venv python3 -m venv env source env/bin/activate
-
Clone this repository
git clone https://github.com/mlcommons/algorithmic-efficiency.git cd algorithmic-efficiency
-
Run pip3 install commands above to install
algorithmic_efficiency
.
Additional Details
You can also install the requirements for individual workloads, e.g. viapip3 install -e '.[librispeech]'
or all workloads at once via
pip3 install -e '.[full]'
Docker
We recommend using a Docker container to ensure a similar environment to our scoring and testing environments.
Prerequisites for NVIDIA GPU set up: You may have to install the NVIDIA Container Toolkit so that the containers can locate the NVIDIA drivers and GPUs. See instructions here.
Building Docker Image
-
Clone this repository
cd ~ && git clone https://github.com/mlcommons/algorithmic-efficiency.git
-
Build Docker Image
cd `algorithmic-efficiency/docker` docker build -t <docker_image_name> . --build-args framework=<framework>
The
framework
flag can be eitherpytorch
,jax
orboth
. Thedocker_image_name
is arbitrary.
Running Docker Container (Interactive)
- Run detached Docker Container
This will print out a container id.
docker run -t -d \ -v $HOME/data/:/data/ \ -v $HOME/experiment_runs/:/experiment_runs \ -v $HOME/experiment_runs/logs:/logs \ -v $HOME/algorithmic-efficiency:/algorithmic-efficiency \ --gpus all \ --ipc=host \ <docker_image_name>
- Open a bash terminal
docker exec -it <container_id> /bin/bash
Running Docker Container (End-to-end)
To run a submission end-to-end in a container see Getting Started Document.
Getting Started
For instructions on developing and scoring your own algorithm in the benchmark see Getting Started Document.
Running a workload
To run a submission directly by running a Docker container, see Getting Started Document.
Alternatively from a your virtual environment or interactively running Docker container submission_runner.py
run:
JAX
python3 submission_runner.py \
--framework=jax \
--workload=mnist \
--experiment_dir=$HOME/experiments \
--experiment_name=my_first_experiment \
--submission_path=reference_algorithms/development_algorithms/mnist/mnist_jax/submission.py \
--tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json
Pytorch
python3 submission_runner.py \
--framework=pytorch \
--workload=mnist \
--experiment_dir=$HOME/experiments \
--experiment_name=my_first_experiment \
--submission_path=reference_algorithms/development_algorithms/mnist/mnist_pytorch/submission.py \
--tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json
Using Pytorch DDP (Recommended)
When using multiple GPUs on a single node it is recommended to use PyTorch's distributed data parallel.
To do so, simply replace python3
by
torchrun --standalone --nnodes=1 --nproc_per_node=N_GPUS
where N_GPUS
is the number of available GPUs on the node. To only see output from the first process, you can run the following to redirect the output from processes 1-7 to a log file:
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8
So the complete command is for example:
torchrun --redirects 1:0,2:0,3:0,4:0,5:0,6:0,7:0 --standalone --nnodes=1 --nproc_per_node=8 \
submission_runner.py \
--framework=pytorch \
--workload=mnist \
--experiment_dir=/home/znado \
--experiment_name=baseline \
--submission_path=reference_algorithms/development_algorithms/mnist/mnist_pytorch/submission.py \
--tuning_search_space=reference_algorithms/development_algorithms/mnist/tuning_search_space.json \
Rules
The rules for the MLCommons Algorithmic Efficency benchmark can be found in the seperate rules document. Suggestions, clarifications and questions can be raised via pull requests.
Contributing
If you are interested in contributing to the work of the working group, feel free to join the weekly meetings, open issues. See our CONTRIBUTING.md for MLCommons contributing guidelines and setup and workflow instructions.
Note on shared data pipelines between JAX and PyTorch
The JAX and PyTorch versions of the Criteo, FastMRI, Librispeech, OGBG, and WMT workloads are using the same TensorFlow input pipelines. Due to differences in how Jax and PyTorch distribute computations across devices, the PyTorch workloads have an additional overhead for these workloads.
Since we use PyTorch's DistributedDataParallel
implementation, there is one Python process for each device. Depending on the hardware and the settings of the cluster, running a TensorFlow input pipeline in each Python process can lead to errors, since too many threads are created in each process. See this PR thread for more details.
While this issue might not affect all setups, we currently implement a different strategy: we only run the TensorFlow input pipeline in one Python process (with rank == 0
), and broadcast the batches to all other devices. This introduces an additional communication overhead for each batch. See the implementation for the WMT workload as an example.