• Stars
    star
    257
  • Rank 158,728 (Top 4 %)
  • Language
    Python
  • Created almost 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

DAWNBench: An End-to-End Deep Learning Benchmark and Competition

DAWNBench Submission Instructions

Thank you for the interest in DAWNBench!

To add your model to our leaderboard, open a Pull Request with title <Model name> || <Task name> || <Author name> (example PR), with JSON (and TSV where applicable) result files in the format outlined below.

Tasks

CIFAR10 Training

Task Description

We evaluate image classification performance on the CIFAR10 dataset.

For training, we have two metrics:

  • Training Time: Train an image classification model for the CIFAR10 dataset. Report the time needed to train a model with test set accuracy of at least 94%
  • Cost: On public cloud infrastructure, compute the total time needed to reach a test set accuracy of 94% or greater, as outlined above. Multiply the time taken (in hours) by the cost of the instance per hour, to obtain the total cost of training the model

Including cost is optional and will only be calculated if the costPerHour field is included in the JSON file. Submissions that only aim for time aren't restricted to public cloud infrastructure.

JSON Format

Results for the CIFAR10 training tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model training was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • costPerHour: [Optional] Reported in USD ($). Cost of instance per hour
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as learning rate schedule, optimization algorithm, framework version, etc.

In addition, report training progress at the end of every epoch in a TSV with the following format,

epoch\thours\ttop1Accuracy

We will compute time to reach a test set accuracy of 94% by reading off the first entry in the above TSV with a top-1 test set accuracy of at least 94%.

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_resnet56_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the CIFAR10/train/ sub-directory.

Example JSON and TSV

JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow",
    "model": "ResNet 56",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "costPerHour": 0.90,
    "timestamp": "2017-08-14",
    "misc": {}
}

TSV

epoch   hours top1Accuracy
1       0.07166666666666667     33.57
2       0.1461111111111111      52.51
3       0.21805555555555556     61.71
4       0.2902777777777778      69.46
5       0.3622222222222222      71.47
6       0.43416666666666665     69.64
7       0.5061111111111111      75.81

CIFAR10 Inference

Task Description

We evaluate image classification performance on the CIFAR10 dataset.

For inference, we have two metrics:

  • Latency: Use a model that has a test set accuracy of 94% or greater. Measure the total time needed to classify all 10,000 images in the CIFAR10 test set one-at-a-time, and then divide by 10,000
  • Cost: Use a model that has a test set accuracy of 94% or greater. Measure the average per-image latency in the CIFAR10 test set, and then multiply by the cost of the instance per unit time

JSON Format

Results for the CIFAR10 inference tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model inference was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • latency: Reported in milliseconds. Time needed to classify one image
  • cost: Reported in USD ($). Cost of performing inference on a single image. Computed as costPerHour * latency
  • top1Accuracy: Reported in percentage points from 0 to 100. Accuracy of model on CIFAR10 test dataset.
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training / inference logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as batch size, framework version, etc.

Note that it is only necessary to specify one of the latency and cost fields outlined above. However, it is encouraged to specify both (if available) in a single JSON result file.

JSON files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_resnet56_1k80-gc_tensorflow.json. Put the JSON file in the CIFAR10/inference/ sub-directory.

Example JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow",
    "model": "ResNet 56",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "latency": 43.45,
    "cost": 1e-6,
    "accuracy": 94.45,
    "timestamp": "2017-08-14",
    "misc": {}
}

ImageNet Training

Task Description

We evaluate image classification performance on the ImageNet dataset.

For training, we have two metrics:

  • Training Time: Train an image classification model for the ImageNet dataset. Report the time needed to train a model with top-5 validation accuracy of at least 93%
  • Cost: On public cloud infrastructure, compute the total time needed to reach a validation accuracy of 93% or greater, as outlined above. Multiply the time taken by the cost of the instance per hour, to obtain the total cost of training the model

Including cost is optional and will only be calculated if the costPerHour field is included in the JSON file. Submissions that only aim for time aren't restricted to public cloud infrastructure.

JSON Format

Results for the ImageNet training tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model training was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • costPerHour: [Optional] Reported in USD ($). Cost of instance per hour
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as learning rate schedule, optimization algorithm, framework version, etc.

In addition, report training progress at the end of every epoch in a TSV with the following format,

epoch\thours\ttop1Accuracy\ttop5Accuracy

We will compute time to reach a top-5 validation accuracy of 93% by reading off the first entry in the above TSV with a top-5 validation accuracy of at least 93%.

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_resnet56_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the ImageNet/train/ sub-directory.

Example JSON and TSV

JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow",
    "model": "ResNet 50",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "costPerHour": 0.90,
    "timestamp": "2017-08-14",
    "misc": {}
}

TSV

epoch   hours top1Accuracy top5Accuracy
1 Β  Β  Β  0.07166666666666667 Β  Β  33.57     68.93
2 Β  Β  Β  0.1461111111111111 Β  Β   52.51     72.48 
3 Β  Β  Β  0.21805555555555556 Β  Β  61.71     81.46
4       0.2902777777777778      69.46     81.92
5 Β  Β  Β  0.3622222222222222 Β  Β   71.47     82.17 
6 Β  Β  Β  0.43416666666666665 Β  Β  69.64     83.68
7 Β  Β  Β  0.5061111111111111 Β  Β   75.81     84.31 

ImageNet Inference

Task Description

We evaluate image classification performance on the ImageNet dataset.

For inference, we have two metrics:

  • Latency: Use a model that has a top-5 validation accuracy of 93% or greater. Measure the total time needed to classify all 50,000 images in the ImageNet validation set one-at-a-time, and then divide by 50,000
  • Cost: Use a model that has a top-5 validation accuracy of 93% or greater. Measure the average latency of performing inference on a single image (as described above), then multiply by cost of the instance per hour to get total time to perform inference

JSON Format

Results for the ImageNet inference tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model inference was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • latency: Reported in milliseconds. Time needed to classify one image
  • cost: Reported in USD ($). Cost of performing inference on a single image. Computed as costPerHour * latency
  • top5Accuracy: Reported in percentage points from 0 to 100. Accuracy of model on ImageNet test dataset.
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training / inference logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as batch size, framework version, etc.

Note that it is only necessary to specify one of the latency and cost fields outlined above. However, it is encouraged to specify both (if available) in a single JSON result file.

JSON files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_resnet56_1k80-gc_tensorflow.json. Put the JSON file in the ImageNet/inference/ sub-directory.

Example JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow",
    "model": "ResNet 50",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "latency": 43.45,
    "cost": 4.27e-6,
    "top5Accuracy": 93.45,
    "timestamp": "2017-08-14",
    "misc": {}
}

SQuAD Training

Task Description

We evaluate question answering performance on the SQuAD dataset.

For training, we have two metrics:

  • Training Time: Train a question answering model for the SQuAD dataset. Report the time needed to train a model with a dev set F1 score of at least 0.73
  • Cost: On public cloud infrastructure, compute the total time needed to reach a dev set F1 score of 0.73 or greater, as outlined above. Multiply the time taken by the cost of the instance per hour, to obtain the total cost of training the model

Including cost is optional and will only be calculated if the costPerHour field is included in the JSON file. Submissions that only aim for time aren't restricted to public cloud infrastructure.

JSON Format

Results for the SQuAD training tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model training was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • costPerHour: [Optional] Reported in USD ($). Cost of instance per hour
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training / inference logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as learning rate schedule, optimization algorithm, framework version, etc.

In addition, report training progress at the end of every epoch in a TSV with the following format,

epoch\thours\tf1Score

We will compute time to reach a F1 score of 0.73 by reading off the first entry in the above TSV with a F1 score of at least 0.73.

JSON and TSV files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_bidaf_1k80-gc_tensorflow.[json|tsv]. Put the JSON and TSV files in the SQuAD/train/ sub-directory.

Example JSON and TSV

JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow_qa/bi-att-flow",
    "model": "BiDAF",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "costPerHour": 0.90,
    "timestamp": "2017-08-14",
    "misc": {}
}

TSV

epoch   hours f1Score
1     0.7638888888888888      0.5369029640999999
2     1.5238381055555557      0.6606892943
3     2.2855751       0.700419426
4     3.0448481305555557      0.7229908705
5     3.806446388888889       0.731013
6     4.5750864       0.7370445132
7     5.346703258333334       0.7413719296

SQuAD Inference

Task Description

We evaluate question answering performance on the SQuAD dataset.

For inference, we have two metrics:

  • Latency: Use a model that has a dev set F1 measure of 0.73 or greater. Measure the total time needed to answer all questions in the SQuAD dev set one-at-a-time, and then divide by the number of questions
  • Cost: Use a model that has a dev set F1 measure of 0.73 or greater. Measure the average latency needed to perform inference on a single question, and then multiply by the cost of the instance

JSON Format

Results for the SQuAD inference tasks can be reported using a JSON file with the following fields,

  • version: DAWNBench competition version (currently v1.0)
  • author: Author name
  • authorEmail: Author email
  • framework: Framework on which training / inference was performed
  • codeURL: [Optional] URL pointing to code for model
  • model: Model name
  • hardware: A short description of the hardware on which model inference was performed. If relevant, please specify Cloud provider and instance type to make results more reproducible
  • latency: Reported in milliseconds. Time needed to answer one question
  • cost: Reported in USD ($). Cost of performing inference on a single question. Computed as costPerHour * latency
  • f1Score: Reported in fraction from 0.0 to 1.0. F1 score of model on SQuAD development dataset
  • timestamp: Date of submission in format yyyy-mm-dd
  • logFilename: [Optional] URL pointing to training / inference logs
  • misc: [Optional] JSON object of other miscellaneous notes, such as batch size, framework version, etc.

Note that it is only necessary to specify one of the latency and cost fields outlined above. However, it is encouraged to specify both (if available) in a single JSON result file.

JSON files are named [author name]_[model name]_[hardware tag]_[framework].json, similar to dawn_bidaf_1k80-gc_tensorflow.json. Put the JSON file SQuAD/inference/ sub-directory.

Example JSON

{
    "version": "v1.0",
    "author": "Stanford DAWN",
    "authorEmail": "[email protected]",
    "framework": "TensorFlow",
    "codeURL": "https://github.com/stanford-futuredata/dawn-benchmark/tree/master/tensorflow_qa/bi-att-flow",
    "model": "BiDAF",
    "hardware": "1 K80 / 30 GB / 8 CPU (Google Cloud)",
    "latency": 590.0,
    "cost": 2e-6,
    "f1Score": 0.7524165510999999,
    "timestamp": "2017-08-14",
    "misc": {}
}

FAQ

  • Can spot instances be used for cost metrics? For submissions including cost, please use on-demand, i.e., non-preemptible, instance pricing. Spot pricing is too volatile for the current release the benchmark. We're open to suggestions on better ways to deal with pricing volatility, so if you have ideas, please pitch them on the google group
  • Is validation time included in training time? No, you don't need to include the time required to calculate validation accuracy and save checkpoints.
  • What happens after I submit a pull request with a new result? After you submit a PR, unit tests should automatically run to determine basic requirements. Assuming the unit tests pass, we review the code and the submission. If it is sufficiently similar to existing results or the difference is easily justified, we accept the submission without reproducing. If there issues with the code or someone questions the results, the process is a little more complicated and can vary from situation to situation. If the issues are small, it may be as simple as changing the JSON file.

Disclosure: The Stanford DAWN research project is a five-year industrial affiliates program at Stanford University and is financially supported in part by founding members including Intel, Microsoft, NEC, Teradata, VMWare, and Google. For more information, including information regarding Stanford’s policies on openness in research and policies affecting industrial affiliates program membership, please see DAWN's membership page.

More Repositories

1

ColBERT

ColBERT: state-of-the-art neural search (SIGIR'20, TACL'21, NeurIPS'21, NAACL'22, CIKM'22, ACL'23, EMNLP'23)
Python
2,998
star
2

macrobase

MacroBase: A Search Engine for Fast Data
Java
661
star
3

ARES

Automated Evaluation of RAG Systems
Python
460
star
4

noscope

Accelerating network inference over video
Python
434
star
5

sparser

Sparser: Raw Filtering for Faster Analytics over Raw Data
C
427
star
6

ASAP

ASAP: Prioritizing Attention via Time Series Smoothing
Jupyter Notebook
184
star
7

FrugalGPT

FrugalGPT: better quality and lower cost for LLM applications
Python
167
star
8

index-baselines

Simple baselines for "Learned Indexes"
HTML
156
star
9

FAST

End-to-end earthquake detection pipeline via efficient time series similarity search
Jupyter Notebook
144
star
10

gavel

Code for "Heterogenity-Aware Cluster Scheduling Policies for Deep Learning Workloads", which appeared at OSDI 2020
Jupyter Notebook
124
star
11

equivariant-transformers

Equivariant Transformer (ET) layers are image-to-image mappings that incorporate prior knowledge on invariances with respect to continuous transformations groups (ICML 2019). Paper: https://arxiv.org/abs/1901.11399
Jupyter Notebook
88
star
12

stk

Python
86
star
13

selection-via-proxy

Python
77
star
14

sinkhorn-label-allocation

Sinkhorn Label Allocation is a label assignment method for semi-supervised self-training algorithms. The SLA algorithm is described in full in this ICML 2021 paper: https://arxiv.org/abs/2102.08622.
Python
53
star
15

readinggroup

45
star
16

cs145-2017

Jupyter Notebook
43
star
17

Willump

Willump Is a Low-Latency Useful Machine learning Platform.
Python
42
star
18

Baleen

Baleen: Robust Multi-Hop Reasoning at Scale via Condensed Retrieval (NeurIPS'21)
Python
42
star
19

msketch

Moments Sketch Code
Jupyter Notebook
39
star
20

Uniserve

A runtime implementation of data-parallel actors.
Java
38
star
21

wmsketch

Sketching linear classifiers over data streams with the Weight-Median Sketch (SIGMOD 2018).
C++
38
star
22

dawn-bench-models

Python
36
star
23

momentsketch

Simplified Moment Sketch Implemntation
Java
36
star
24

blazeit

Its BlazeIt because it's blazing fast
C++
28
star
25

optimus-maximus

To Index or Not to Index: Optimizing Exact Maximum Inner Product Search
Python
26
star
26

ACORN

state-of-the-art search over vector embeddings and structured data (SIGMOD '24)
C++
25
star
27

acidrain

2AD analysis prototype and logs from sample applications
Python
22
star
28

lit-code

Code for LIT, ICML 2019
Python
21
star
29

POP

Code for "Solving Large-Scale Granular Resource Allocation Problems Efficiently with POP", which appeared at SOSP 2021
Python
20
star
30

loa

Public code for LOA
Python
18
star
31

omg

Python
17
star
32

pytorch-distributed

Fork of diux-dev/imagenet18
Python
13
star
33

tasti

Semantic Indexes for Machine Learning-based Queries over Unstructured Data (SIGMOD 2022)
Python
13
star
34

cs245-as1

Student files for CS245 Programming Assignment 1: In-memory data layout
Java
12
star
35

offload-annotations

A new approach for bringing heterogeneous computing to existing libraries and workloads.
Python
9
star
36

Willump-Simple

Willump Is a Low-Latency Useful Machine learning Platform.
Python
8
star
37

cs245-as3-public

Durable transactions assignment for CS245
Java
7
star
38

InQuest

Accelerating Aggregation Queries on Unstructured Streams of Data
Python
7
star
39

cs245-as2-public

Scala
7
star
40

training_on_a_dime

Scripts and logs for "Analysis and Expoitation of Dynamic Pricing in the Public Cloud for ML Training", which is to appear at DISPA 2020
Jupyter Notebook
7
star
41

SparseJointShift

Model Performance Estimation and Explanation When Labels and A Few Features Shifts
Python
7
star
42

DROP

Java
6
star
43

tKDC

Repository for tKDE Experiments
Jupyter Notebook
6
star
44

sketchstore

Algorithms for compressing and merging large collections of sketches
Jupyter Notebook
5
star
45

parallel-lb-simulator

Java
4
star
46

crosstrainer

CrossTrainer: Practical Domain Adaptation with Loss Reweighting
Python
4
star
47

smol

C++
4
star
48

supg

Python
3
star
49

fast-tree

C++
3
star
50

abae

Accelerating Approximate Aggregation Queries with Expensive Predicates (VLDB 21)
Python
3
star
51

graphIO

Automated Lower Bounds on the I/O Complexity of Computation Graphs
Python
3
star
52

futuretea-whyrust

Why Rust presentation at FutureTea, 3/13
Rust
3
star
53

ezmode

An iterative algorithm for selecting rare events in large, unlabeled datasets
Python
1
star
54

willump-dfs

Applying Willump design to deep feature synthesis
Python
1
star
55

fexipro-benchmarking

C++
1
star
56

macrobase-cpp

1
star
57

swag-python

Situationally aWAre decodinG
Python
1
star