Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads
This repository contains the source code implementation of the OSDI paper "Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads".
Directory Structure
scheduler
Code for the scheduler, including the scheduling mechanism and simulator
(scheduler.py
), implementations of performance-aware policies (policies/
),
GavelIterator
as a Python module, and a communication stack between the scheduler
and workers that uses gRPC (runtime/
).
scheduler/notebooks
contains parsing and plotting code to analyze experiment
runs.
workloads
Implementations of target workloads in PyTorch, including changes needed to
integrate with the GavelIterator
.
Setup
Software Dependencies
Gavel is implemented in Python. We have tested Gavel on Ubuntu 16.04 with Python 3.8. Python 3.8 can be installed using Miniconda.
Required software dependencies can be installed using,
apt-get -y install cmake g++ gcc libnuma-dev make numactl zlib1g-dev
pip install -r scheduler/requirements.txt
cd scheduler; make
These software dependencies have already been installed on the following AMI on Amazon EC2,
Field | Value |
---|---|
Cloud Provider | AWS |
Region | us-east-1 |
AMI ID | ami-03e41a79bb745ce18 |
AMI Name | gavel |
See this link for how to find and launch a public AMI (this assumes you have a valid billable AWS account setup).
Getting Started
Gavel's heterogeneity-aware policies and scheduling mechanism can be evaluated either in simulation or on a physical cluster.
To evaluate variants of the LAS policy (max_min_fairness*
) in
simulation, one can use the following command line (this sweep script runs
the different policies for multiple continuous traces, generated using
different seeds and Poisson arrival rates):
python -u scripts/sweeps/run_sweep_continuous.py -s 4000 -e 5000 -l /path/to/log/directory -j 6 -p max_min_fairness max_min_fairness_perf --seeds 0 1 2 -c 36:36:36 -a 0.0 -b 1.0 -n 5
Other arguments for the run_sweep_continuous.py
script are
shown using the -h
option:
usage: run_sweep_continuous.py [-h] [-l LOG_DIR] [-s WINDOW_START] [-e WINDOW_END] [-t TIMEOUT] [-j PROCESSES] [-p POLICIES [POLICIES ...]] [-c CLUSTER_SPEC [CLUSTER_SPEC ...]]
[--num_gpus_per_server NUM_GPUS_PER_SERVER] [--seeds SEEDS [SEEDS ...]] [-i INTERVAL] [-f FIXED_JOB_DURATION]
[--cutoff-throughputs-file CUTOFF_THROUGHPUTS_FILE] [--throughputs-file THROUGHPUTS_FILE] [-m] [--generate-multi-priority-jobs]
[--simulate-steady-state] [--solver {ECOS,GUROBI,SCS}] [-v] [--checkpoint-threshold CHECKPOINT_THRESHOLD]
[--profiling_percentages PROFILING_PERCENTAGES [PROFILING_PERCENTAGES ...]] [--num_reference_models NUM_REFERENCE_MODELS [NUM_REFERENCE_MODELS ...]]
[--ideal] [-a THROUGHPUT_LOWER_BOUND] [-b THROUGHPUT_UPPER_BOUND] [-n NUM_DATA_POINTS] [-u UTILIZATION_THRESHOLD]
Sweep through lambda values
optional arguments:
-h, --help show this help message and exit
-l LOG_DIR, --log-dir LOG_DIR
Log directory
-s WINDOW_START, --window-start WINDOW_START
Measurement window start (job ID)
-e WINDOW_END, --window-end WINDOW_END
Measurement window end (job ID)
-t TIMEOUT, --timeout TIMEOUT
Timeout (in seconds) for each run
-j PROCESSES, --processes PROCESSES
Number of processes to use in pool (use as many as available if not specified)
-p POLICIES [POLICIES ...], --policies POLICIES [POLICIES ...]
List of policies to sweep
-c CLUSTER_SPEC [CLUSTER_SPEC ...], --cluster-spec CLUSTER_SPEC [CLUSTER_SPEC ...]
Cluster specification in the form of #v100s:#p100s:#k80s
--num_gpus_per_server NUM_GPUS_PER_SERVER
Cluster specification in the form of #v100s:#p100s:#k80s
--seeds SEEDS [SEEDS ...]
List of random seeds
-i INTERVAL, --interval INTERVAL
Interval length (in seconds)
-f FIXED_JOB_DURATION, --fixed-job-duration FIXED_JOB_DURATION
If set, fixes the duration of all jobs to the specified value (in seconds)
--cutoff-throughputs-file CUTOFF_THROUGHPUTS_FILE
If set, uses the attached cutoff_throughputs JSON file in sweep to limit args run
--throughputs-file THROUGHPUTS_FILE
Oracle throughputs file
-m, --generate-multi-gpu-jobs
If set, generates multi-GPU jobs according to a pre-defined distribution
--generate-multi-priority-jobs
If set, generates some jobs with higher priority
--simulate-steady-state
If set, adds as many jobs as there are workers before beginning the simulation.
--solver {ECOS,GUROBI,SCS}
CVXPY solver
-v, --verbose Verbose
--checkpoint-threshold CHECKPOINT_THRESHOLD
Checkpoint threshold, None if checkpointing is disabled. Checkpoint is created after this job ID is added.
--profiling_percentages PROFILING_PERCENTAGES [PROFILING_PERCENTAGES ...]
Percentages of machines dedicated to profiling co-located job pairs
--num_reference_models NUM_REFERENCE_MODELS [NUM_REFERENCE_MODELS ...]
Number of reference models to use when estimating throughputs
--ideal Run allocations 100% ideally
Automatic sweep:
-u UTILIZATION_THRESHOLD, --utilization-threshold UTILIZATION_THRESHOLD
Utilization threshold to use when automatically sweeping lambdas
Sweep over fixed range:
-a THROUGHPUT_LOWER_BOUND, --throughput-lower-bound THROUGHPUT_LOWER_BOUND
Lower bound for throughput interval to sweep
-b THROUGHPUT_UPPER_BOUND, --throughput-upper-bound THROUGHPUT_UPPER_BOUND
Upper bound for throughput interval to sweep
-n NUM_DATA_POINTS, --num-data-points NUM_DATA_POINTS
Number of data points to sweep through
To evaluate policies on static traces (jobs only added to the cluster at the start
of the trace), one can use the scripts/sweeps/run_sweep_static.py
script, which
runs different policies on multiple static traces, generated using different
seeds and numbers of jobs.
For more detailed instructions on how to reproduce results from the OSDI paper, see EXPERIMENTS.md.