• Stars
    star
    122
  • Rank 292,031 (Top 6 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs

TpuGraphs is a performance prediction dataset on full tensor programs, represented as computational graphs, running on Tensor Processing Units (TPUs). Each graph in the dataset represents the main computation of a machine learning workload, e.g., a training epoch or an inference step. Each data sample contains a computational graph, a compilation configuration, and the execution time of the graph when compiled with the configuration. The graphs in the dataset are collected from open-source machine learning programs, featuring popular model architectures (e.g., ResNet, EfficientNet, Mask R-CNN, and Transformer).

Please refer to our paper for more details about the importance and challenges of the dataset, how the dataset is generated, the model baselines, and the experimental results. If you find this dataset useful in your research, please cite our paper as:

@inproceedings{tpugraphs,
  title={TpuGraphs: A Performance Prediction Dataset on Large Tensor Computational Graphs},
  author={Phitchaya Mangpo Phothilimthana and Sami Abu-El-Haija and Kaidi Cao and Bahare Fatemi and Michael Burrows and Charith Mendis and Bryan Perozzi},
  booktitle={Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
  year={2023},
  url={https://openreview.net/forum?id=plAix1NxhU}
}

This is not an officially supported Google product.

Dataset

The dataset consists of two compiler optimization collections: layout and tile. Layout configurations control how tensors are laid out in the physical memory, by specifying the dimension order of each input and output of an operation node. A tile configuration controls the tile size of each fused subgraph.

The dataset is located at http://download.tensorflow.org/data/tpu_graphs/v0. You can use wget or curl command to download files.

To download all files, please follow one of these options:

  1. Download http://download.tensorflow.org/data/tpu_graphs/v0/npz_all.tar, preferably, to ~/data/tpugraphs (as our training pipelines read from there), then untar as tar xvf npz_all.tar. This can be done with bash commands:
mkdir -p ~/data/tpugraphs
cd ~/data/tpugraphs
curl http://download.tensorflow.org/data/tpu_graphs/v0/npz_all.tar > npz_all.tar
tar xvf npz_all.tar
  1. Download from Kaggle.

  2. Use our helper script echo_download_command.py. From a clone of this directory:

python3 echo_download_commands.py | bash

Removing the last pipe (| bash) shows the commands for downloading the dataset (a few curl commands followed by tar xvf).

To download {train, test, validation} data for layout collection, e.g., for layout:xla:random, run:

mkdir -p ~/data/tpugraphs
cd ~/data/tpugraphs

curl http://download.tensorflow.org/data/tpu_graphs/v0/npz_layout_xla_random_train.tar > npz_layout_xla_random_train.tar
curl http://download.tensorflow.org/data/tpu_graphs/v0/npz_layout_xla_random_valid.tar > npz_layout_xla_random_valid.tar
curl http://download.tensorflow.org/data/tpu_graphs/v0/npz_layout_xla_random_test.tar > npz_layout_xla_random_test.tar
tar xvf npz_layout_xla_random_train.tar
tar xvf npz_layout_xla_random_valid.tar
tar xvf npz_layout_xla_random_test.tar

For a description of these files, you may scroll down to "Dataset File Description".

Running Baseline Models

This repo hosts two training pipelines: tiles_train.py and layout_train.py, respectively, for training models on the collections tile:xla and layout:{nlp|xla}:{random|default}. Both scripts train for some epochs then infer predictions on the test set. By default, the trained models are saved and alongside a csv file containing inference on the test set.

To combine all five inference files into one CSV to submit to our Kaggle competition, run:

python combine_csvs.py

NOTE: The above command will look for files produced by tiles_train.py and layout_train.py

Python environment setup with Conda

conda create -n tpugraphs python=3.10
conda activate tpugraphs

conda install -c conda-forge tensorflow
conda install -c conda-forge tqdm

pip install tensorflow_gnn --pre
pip install tensorflow-ranking
conda clean --all

For subsequent runs, simply activate the same environment with conda activate tpugraphs.

Model on tile:xla collection

Train model

The following command will train a GraphSAGE model with the early join of config features on a small subset of data:

python tiles_train.py --model=EarlyJoinSAGE --toy_data=True

To train on the full dataset, run:

python tiles_train.py --model=EarlyJoinSAGE

The current code supports training on a CPU. Once the training is done, it will produce a jsonz file with the prefix "run_". This file will contain the overall top-K errors (see the definition in the paper) on kernels in the validation set. To view the result:

zcat run_xxx.jsonz > run_xxx.json

Search for:

"final_error": {"val": {"1": <top-1 error>, "5": <top-5 error>, "10": <top-10 error>}}

where 0.2 error means 20% error.

Further, the training code will output a .csv file containing top-5 rankings of configurations over test set. By default, the csv will be written to:

~/out/tpugraphs_tiles/results_<timestamp>.csv

The path can be overridden with flags --out_dir and --results_csv. Please refer to train_args.py for a list of flags.

Sweep hyperparameters

Run Apache Beam locally (for debugging):

python tiles_beam_experiments.py --debug

To run the pipeline on Google Cloud, please follow this instruction.

Evaluate model

To evaluate models trained on the tile collection, look for the model directory (the training pipeline defaults flag --out_dir to ~/out/tpugraphs_tiles) which should start with prefix model_. To evaluate model(s), run:

python tiles_evaluate.py --dirs <comma-separated list of model dirs>

This script will print out per-program top-K errors for kernels in the validation set in the following format:

{
  "K": {  # top-K error
    <program> : <error>,
    ...
  },
  ...
}

Currently, the evaluation script does not produce the ranking .csv file.

Model on layout:{xla|nlp}:{random|default} collections

We provide baseline for the layout collections in this repo. You can train the layout baseline model by invoking:

# As a test.
python layout_train.py --epochs 10 --toy_data=True

# On xla:random
python layout_train.py --source xla --search random --epochs 10 --max_configs 1000

# On xla:default
python layout_train.py --source xla --search default --epochs 10 --max_configs 1000

# On nlp:random
python layout_train.py --source nlp --search random --epochs 10 --max_configs 1000

# On nlp:default
python layout_train.py --source nlp --search default --epochs 10 --max_configs 1000

NOTE: For training on the NLP collections, since the data is large, our trainer script cannot fit the data into memory. The flag --max_configs 1000 allows us to run. It samples only this many configurations per graph. However, you may write your own scalable implementation, or modify ours.

For an alternative implementation (that uses PyTorch), you may view our collaborators' Graph Segmented Training implementation (GST) at https://github.com/kaidic/GST.

Each (complete) invocation of python layout_train.py should train the model, followed by inference on the test set. The inference step produces a ranking .csv file, by default at:

~/out/tpugraphs_layout/results_<timestamp>_<source>_<search>.csv

Example: ~/out/tpugraphs_layout/results_1693169615975_xla_default.csv.

NOTE: You can run python combine_csvs.py to produces the final CSV that can be submitted to our Kaggle competition. The tool requires 5 input CSV files corresponding to collections tile:xla, layout:xla:default, layout:xla:random, layout:nlp:default, layout:nlp:random. You may specify them as flag arguments. By default, combine_csvs.py will choose the most-recent timestamp files, searching in the default directories produced by the training pipelines (i.e. ~/out/tpugraphs_layout for layout_train.py, and ~/out/tpugraphs_tiles/ for tiles_train.py).

Evaluate model

To evaluate models on the validation set of layout collections, please refer to tpu_graphs/evals.

Dataset File Description

Tiles Collection .npz files

We provide our dataset as .npz files. Download instructions are in Section "Copy dataset files".

Suppose a .npz file stores a graph (representing a kernel) with n nodes and m edges. In addition, suppose we compile the graph with c different configurations, and run each on a TPU. Crucially, the configuration is at the graph-level. Then, the .npz file stores the following dictionary (can be loaded with d = dict(np.load("npz/tile/xla/train/<pick 1>.npz"))):

  • Key "node_feat": contains float32 matrix with shape (n, 140). The uth row contains the feature vector for node u < n (please see Subsection "Node Features", below). Nodes are ordered topologically.
  • Key "node_opcode" contains int32 vector with shape (n, ). The uth entry stores the op-code for node u (please see the mapping of opcode to instruction name here).
  • Key "edge_index" contains int32 matrix with shape (m, 2). If entry i is = [u, v] (where 0 <= u, v < n), then there is a directed edge from node u to node v, where u consumes the output of v.
  • Key "config_feat" contains float32 matrix with shape (c, 24) with row j containing the (graph-level) configuration feature vector (please see Subsection "Tile Config Features").
  • Keys "config_runtime" and "config_runtime_normalizers": both are int64 vectors of length c. Entry j stores the runtime (in nanoseconds) of the given graph compiled with configuration j and a default configuration, respectively. Samples from the same graph may have slightly different "config_runtime_normalizers" because they are measured from different runs on multiple machines.

Finally, for the tile collection, your job is to predict the indices of the best configurations (i.e., ones leading to the smallest d["config_runtime"] / d["config_runtime_normalizers"]).

Layout Collections .npz files

Suppose a .npz file stores a graph (representing the entire program) with n nodes and m edges. In addition, suppose we compile the graph with c different configurations, and run each on a TPU. Crucially, the configuration is at the node-level.. Suppose that nc of the n nodes are configurable. Then, the .npz file stores the following dictionary (can be loaded with, e.g., d = dict(np.load("npz/layout/xla/random/train/unet3d.npz"))):

  • Keys "node_feat", "node_opcode", "edge_index", are like above.
  • Key "node_config_ids" contains int32 vector with shape (nc, ) and every entry is in {0, 1, ..., n - 1} i.e. indicating the indices of the configurable nodes. For these nodes, they can have an additional feature vector that instructs the compiler (described next).
  • Key "node_config_feat" contains float32 tensor with shape (c, nc, 18). Entry [j, k] gives an 18-dimensional vector describing the configuration features for node d["node_config_ids"][k] for the jth run (please see Subsection "Layout Config Features", below).
  • Key "config_runtime" contains int32 vector with shape (c, ) where the jth entry contains the runtime of the jth run (i.e., when nodes are configured with d["node_config_feat"][j]).

Finally, for the layout collections, your job is to predict sort the indices from best-to-worse configurations (i.e., ones leading to the smallest d["config_runtime"]). We do not have to use runtime normalizers for this task because the runtime variation at the entire program level is very small.

Optionally, you may access key "node_splits", which is a variable-length list of node IDs that are the starting of HLO computations in the graph (similar to functions in a program). Essentially, nodes d["node_splits"][i] to d["node_splits"][i+1] - 1 belongs to the same computation. If you want to partition the graph into multiple segments, this information may be useful, e.g., putting nodes from the same computation in the same partition. However, you may compute your own partitioning (e.g., using METIS) as well.

Features

Node Features

To extract a node feature vector, we either copy values from various fields in an XLA’s HLO instruction (a node in an HLO graph) as they are, or convert categorical values using one-hot encoding. To convert an unbounded list of numbers (e.g. tensor shape) to a fixed-size vector, we truncate the list to six elements and include the summation and/or product of all elements in the list (e.g., the product of dimension sizes represents the volume of the tensor). In our dataset, none of the tensors has more than six dimensions.

The following describe each element at a particular index in the node feature vector.

0: is_root - whether this node is the output
1: element_size_in_bits - deprecated, always 0
// 2–20: One hot vector of shape_element_type.
2: shape_element_type_is_invalid_type
3: shape_element_type_is_pred
4: shape_element_type_is_s8
5: shape_element_type_is_s16
6: shape_element_type_is_s32
7: shape_element_type_is_s64
8: shape_element_type_is_u8
9: shape_element_type_is_u16
10: shape_element_type_is_u32
11: shape_element_type_is_u64
12: shape_element_type_is_f16
13: shape_element_type_is_f32
14: shape_element_type_is_f64
15: shape_element_type_is_bf16
16: shape_element_type_is_c64
17: shape_element_type_is_c128
18: shape_element_type_is_tuple
19: shape_element_type_is_opaque_type
20: shape_element_type_is_token
// 21–28: Size (number of elements) for each dimension, or an upper bound on the size if the dimension is dynamic.  In XLA, dimensions are numbered from 0 to N-1 for an N-dimensional array. The first element of 'shape_dimensions' is the size of dimension 0, the second element is the size of dimension 1, and so forth.  Empty list indicates a scalar.
21: shape_dimensions_0
22: shape_dimensions_1
23: shape_dimensions_2
24: shape_dimensions_3
25: shape_dimensions_4
26: shape_dimensions_5
27: shape_dimensions_sum
28: shape_dimensions_product
29: shape_tuple_shapes_size - for tuples only, the shapes of constituent shapes in the tuple sequence
30: parameter_number = K - indicating that is is the Kth parameter to the computation, only for Parameter operation
// 31–36: Dimensions present for some operations that require reshaping or broadcasting, including Reshape, Reduce, ReduceWindow, and Reverse.
31: dimensions_0
32: dimensions_1
33: dimensions_2
34: dimensions_3
35: dimensions_4
36: dimensions_5
// 37–92: Windowing information in an operation such as convolution. The window is moved across a base area and for each position of the window a computation is performed.
37: window_size_0
38: window_size_1
39: window_size_2
40: window_size_3
41: window_size_4
42: window_size_5
43: window_size_sum
44: window_size_product
45: window_stride_0
46: window_stride_1
47: window_stride_2
48: window_stride_3
49: window_stride_4
50: window_stride_5
51: window_stride_sum
52: window_stride_product
53: window_padding_low_0
54: window_padding_low_1
55: window_padding_low_2
56: window_padding_low_3
57: window_padding_low_4
58: window_padding_low_5
59: window_padding_low_sum
60: window_padding_low_product
61: window_padding_high_0
62: window_padding_high_1
63: window_padding_high_2
64: window_padding_high_3
65: window_padding_high_4
66: window_padding_high_5
67: window_padding_high_sum
68: window_padding_high_product
// 69–76: Dilation factor of the sliding window. A dilation factor of 1 means no dilation. window_dilation - 1 no-op entries ("holes") are implicitly placed between each kernel element.
69: window_window_dilation_0
70: window_window_dilation_1
71: window_window_dilation_2
72: window_window_dilation_3
73: window_window_dilation_4
74: window_window_dilation_5
75: window_window_dilation_sum
76: window_window_dilation_product
// 77-84: Dilation factor of the base area. A dilation factor of 1 means no dilation. base_dilation - 1 no-op entries ("holes") are implicitly placed between each base area element.
77: window_base_dilation_0
78: window_base_dilation_1
79: window_base_dilation_2
80: window_base_dilation_3
81: window_base_dilation_4
82: window_base_dilation_5
83: window_base_dilation_sum
84: window_base_dilation_product
// 85-92: Window reversal means that this dimension was logically reversed before the operation.
85: window_window_reversal_0
86: window_window_reversal_1
87: window_window_reversal_2
88: window_window_reversal_3
89: window_window_reversal_4
90: window_window_reversal_5
91: window_window_reversal_true_count
92: window_window_reversal_false_count
// 93–106: The dimension numbers used for a convolution.
93: convolution_dim_numbers_input_batch_dim - the dimension number that represents batch in the input
94: convolution_dim_numbers_input_feature_dim - the dimension number that represents features in the input
// 95–98: Dimension numbers for the spatial dimensions that the window moves through in the input.
95: convolution_dim_numbers_input_spatial_dims_0
96: convolution_dim_numbers_input_spatial_dims_1
97: convolution_dim_numbers_input_spatial_dims_2
98: convolution_dim_numbers_input_spatial_dims_3
99: convolution_dim_numbers_kernel_input_feature_dim - the dimension number that represents input features in the convolutional kernel (rhs)
100: convolution_dim_numbers_kernel_output_feature_dim - the dimension number that represents output features in the convolutional kernel (rhs)
// 101-104: Dimension numbers for the spatial dimensions that the window moves through in the kernel (rhs). window.strides(0) is the stride in the kernel_spatial_dimensions(0) dimension.
101: convolution_dim_numbers_kernel_spatial_dims_0
102: convolution_dim_numbers_kernel_spatial_dims_1
103: convolution_dim_numbers_kernel_spatial_dims_2
104: convolution_dim_numbers_kernel_spatial_dims_3
105: convolution_dim_numbers_output_batch_dim - the dimension number that represents batch in the output
106: convolution_dim_numbers_output_feature_dim - the dimension number that represents features in the output
107: feature_group_count - the number of feature groups, used for a convolution. Must be a divisor of the input feature dimension and output feature dimension. If not specified, it will use a default value of 1.
108: batch_group_count - the number of batch groups, used for a convolution.
// 109–120: [begin/start, end/limit) index range and stride for a slice operation.
109: slice_dims_start_0
110: slice_dims_start_1
111: slice_dims_start_sum
112: slice_dims_start_product
113: slice_dims_stride_0
114: slice_dims_stride_1
115: slice_dims_stride_sum
116: slice_dims_stride_product
117: slice_dims_limit_0
118: slice_dims_limit_1
119: slice_dims_limit_sum
120: slice_dims_limit_product
// 121 - 124: [start, start + size) range size for a dynamic slice ('start' is specified dynamically in the second operand of the operation).
121: dynamic_slice_sizes_0
122: dynamic_slice_sizes_1
123: dynamic_slice_sizes_sum
124: dynamic_slice_sizes_product
// 125–132: Padding configuration that describes the edge padding of a pad operation.
125: padding_config_edge_padding_low_0
126: padding_config_edge_padding_low_1
127: padding_config_edge_padding_low_sum
128: padding_config_edge_padding_low_product
129: padding_config_edge_padding_high_0
130: padding_config_edge_padding_high_1
131: padding_config_edge_padding_high_sum
132: padding_config_edge_padding_high_product
133: is_stable - whether this Sort operation should be stable
// 134–139: Physical layout used to pack the tensor shape.
134: layout_minor_to_major_0
135: layout_minor_to_major_1
136: layout_minor_to_major_2
137: layout_minor_to_major_3
138: layout_minor_to_major_4
139: layout_minor_to_major_5

Suffix _i, where i is an integer, indicates the information for the tensor dimension i. If a tensor has N dimensions, feature values of _i are set to 0 if i >= N (0 padding). Suffix _sum is the summation of the feature values across all dimensions. Suffix _product is the product of the feature values across all dimensions.

The source code of the feature extractor can be found here, which extracts features/attributes from HloProto defined here.

Tile Config Features

The following describe each element at a particular index in the tile config feature vector.

// 0–7: Tile sizes of the convolution kernel, only for a convolution operation.
0: kernel_bounds_0
1: kernel_bounds_1
2: kernel_bounds_2
3: kernel_bounds_3
4: kernel_bounds_4
5: kernel_bounds_5
6: kernel_bounds_sum
7: kernel_bounds_product
// 8–15: Output tile sizes.
8: output_bounds_0
9: output_bounds_1
10: output_bounds_2
11: output_bounds_3
12: output_bounds_4
13: output_bounds_5
14: output_bounds_sum
15: output_bounds_product
// 16-23: Input tile sizes.
16: input_bounds_0
17: input_bounds_1
18: input_bounds_2
19: input_bounds_3
20: input_bounds_4
21: input_bounds_5
22: input_bounds_sum
23: input_bounds_product

Note that input_bounds are usually set to 0 because they can be inferred by the compiler from output_bounds (and kernel_bounds). If a tensor has N dimensions, feature values of _i are set to 0 if i >= N (0 padding).

Layout Config Features

The following describe each element at a particular index in the per-node layout config feature vector.

// 0–5: Physical layout of the output tensor
0: output_layout_0
1: output_layout_1
2: output_layout_2
3: output_layout_3
4: output_layout_4
5: output_layout_5
// 6-11: Physical layout of the input  tensor
6: input_layout_0
7: input_layout_1
8: input_layout_2
9: input_layout_3
10: input_layout_4
11: input_layout_5
// 12-17: Physical layout of the kernel tensor, only for a convolution operation
12: kernel_layout_0
13: kernel_layout_1
14: kernel_layout_2
15: kernel_layout_3
16: kernel_layout_4
17: kernel_layout_5

If a tensor has N dimensions, feature values of _i are set to -1 if i >= N (-1 padding). A layout determines the order of minor-to-major tensor dimensions. For example, the layout of {1, 0, 2, -1, -1, -1} of a 3D tensor indicates that dimension 1 is the most minor (elements of the most minor dimension are consecutive in the physical space) and dimension 2 is the most major. We also include a tensor layout of {-1, -1, -1, -1, -1, -1} to indicate the compiler to apply its default strategy of selecting the layout for that tensor.

Graph Feature Extraction

This section explains how to customize the node feature extraction. The raw graph data is saved in protobuf format, defined as ModuleTuningData in tpu_graphs/proto/tuning.proto. FeaturizeHloInstruction in tpu_graphs/process_data/xla/featurizers.h contains the main logic to extract node features from a raw graph. To customize node feature extraction, you can modify tpu_graphs/process_data/xla/featurizers.h and the corresponding size of the node feature vector defined in tpu_graphs/process_data/xla/hlo_opcode.h.

Then, build the following C++ file to create Python module for extracting graph features:

sudo apt install bazel-5.4.1
bazel-5.4.1 build -c opt tpu_graphs/process_data/xla/graph_features.so --experimental_repo_remote_exec
cp bazel-bin/tpu_graphs/process_data/xla/graph_features.so .

The function can be used in Python as follows:

import graph_features
node_opcode, node_feat, edge_index, node_config_ids, node_splits = graph_features.extract_graph_features("<path_to_raw_protobuf_data>.pb")

The directory structure and filenames of .pb files match those of .npz files. In particular, the original graph in npz/.../xxx.npz can be found in pb/.../xxx.pb. Therefore, you can use graph features produced by this function instead of (node_opcode, node_feat, edge_index, node_config_ids, node_splits) attributes from the .npz file.

Testing in C++

To test that your code is working properly within C++, you can run:

bazel-5.4.1 build -c opt tpu_graphs/process_data/xla/data_main --experimental_repo_remote_exec --sandbox_debug --verbose_failures
./bazel-bin/tpu_graphs/process_data/xla/data_main <path_to_raw_protobuf_data>.pb

Troubleshooting

If you get libstdc++.so.6: version GLIBCXX_3.4.30 not found during import graph_features, you can follow these steps. First, check if you have /usr/lib/x86_64-linux-gnu/libstdc++.so.6.0.30. If not, install it. Then, fix your conda environment to use that version.

cp /usr/lib/i386-linux-gnu/libstdc++.so.6.0.30 /usr/local/google/home/{user}/miniconda3/envs/tpugraphs/lib/
rm /usr/local/google/home/{user}/miniconda3/envs/tpugraphs/lib/libstdc++.so.6
sudo ln -sf /usr/local/google/home/mangpo/miniconda3/envs/tpugraphs/lib/libstdc++.so.6.0.30 /usr/local/google/home/mangpo/miniconda3/envs/tpugraphs/lib/libstdc++.so.6

More Repositories

1

Objectron

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
Jupyter Notebook
2,222
star
2

wit

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
989
star
3

natural-questions

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.
Python
921
star
4

paws

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Python
546
star
5

dstc8-schema-guided-dialogue

The Schema-Guided Dialogue Dataset
Python
541
star
6

conceptual-captions

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.
Shell
515
star
7

ToTTo

ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.
434
star
8

conceptual-12m

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
357
star
9

tydiqa

TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the training and evaluation of automatic question answering systems. This repository provides evaluation code and a baseline system for the dataset.
Python
288
star
10

wiki-reading

This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets).
Python
270
star
11

hiertext

The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.
Jupyter Notebook
261
star
12

coarse-discourse

A large corpus of discourse annotations and relations on ~10K forum threads.
Python
238
star
13

simulated-dialogue

226
star
14

gap-coreference

GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practical applications.
Python
224
star
15

KELM-corpus

212
star
16

Taskmaster

Please see the readme file as well as our 2019 EMNLP paper linked here -->
192
star
17

dakshina

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.
185
star
18

word_sense_disambigation_corpora

SemCor and Masc documents annotated with NOAD word senses.
182
star
19

cvss

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus
182
star
20

Nutrition5k

Detailed visual + nutritional data for over 5,000 plates of food.
Python
155
star
21

C4_200M-synthetic-dataset-for-grammatical-error-correction

This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)
Python
152
star
22

boolean-questions

144
star
23

MAVE

The dataset contains 3 million attribute-value annotations across 1257 unique categories on 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.
Python
136
star
24

wiki-split

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
123
star
25

sentence-compression

Large corpus of uncompressed and compressed sentences from news articles.
121
star
26

QED

QED: A Framework and Dataset for Explanations in Question Answering
Python
114
star
27

RxR

Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual perceptions of the annotators
Python
112
star
28

presto

A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs
111
star
29

wiki-atomic-edits

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
104
star
30

clang8

cLang-8 is a dataset for grammatical error correction.
Python
99
star
31

richhf-18k

RichHF-18K dataset contains rich human feedback labels we collected for our CVPR'24 paper: https://arxiv.org/pdf/2312.10240, along with the file name of the associated labeled images (no urls or images are included in this dataset).
89
star
32

seahorse

Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness, covering 6 languages, 9 systems and 4 datasets.
84
star
33

screen_qa

ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
84
star
34

query-wellformedness

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
82
star
35

xsum_hallucination_annotations

Faithfulness and factuality annotations of XSum summaries from our paper "On Faithfulness and Factuality in Abstractive Summarization" (https://www.aclweb.org/anthology/2020.acl-main.173.pdf).
80
star
36

videoCC-data

VideoCC is a dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset.
75
star
37

vrdu

We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types, complex templates, and diversity of layouts within a single document type.
72
star
38

Synthetic-Persona-Chat

The Synthetic-Persona-Chat dataset is a synthetically generated persona-based dialogue dataset. It extends the original Persona-Chat dataset.
Python
71
star
39

TimeDial

Temporal Commonsense Reasoning in Dialog
69
star
40

uninum

A database of number names for 186 languages, locales, and scripts
66
star
41

scin

The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.
Jupyter Notebook
62
star
42

TextNormalizationCoveringGrammars

Covering grammars for English and Russian text normalization
Makefile
60
star
43

Disfl-QA

A Benchmark Dataset for Understanding Disfluencies in Question Answering
60
star
44

relation-extraction-corpus

Automatically exported from code.google.com/p/relation-extraction-corpus
55
star
45

WikipediaHomographData

Labeled data for homograph disambiguation
53
star
46

GSM-IC

Grade-School Math with Irrelevant Context (GSM-IC) benchmark is an arithmetic reasoning dataset built upon GSM8K, by adding irrelevant sentences in problem descriptions. GSM-IC is constructed to evaluate the distractibility of language models.
50
star
47

Crisscrossed-Captions

Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO
Python
48
star
48

bam

Python
48
star
49

screen_annotation

The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and describe the UI elements present on the screen: their type, location, OCR text and a short description. It has been introduced in the paper `ScreenAI: A Vision-Language Model for UI and Infographics Understanding`.
46
star
50

synthetic-fur

A procedurally generated synthetic fur dataset with conditional inputs for machine learning and neural rendering.
46
star
51

screen2words

The dataset includes screen summaries that describes Android app screenshot's functionalities. It is used for training and evaluation of the screen2words models (our paper accepted by UIST'21 will be linked soon).
44
star
52

swim-ir

SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
43
star
53

clay

The dataset includes UI object type labels (e.g., BUTTON, IMAGE, CHECKBOX) that describes the semantic type of an UI object on Android app screenshots. It is used for training and evaluation of the screen layout denoising models (paper will be linked soon).
43
star
54

wiki-links

Automatically exported from code.google.com/p/wiki-links
42
star
55

Attributed-QA

We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in information-seeking scenarios. This release consists of human-rated system outputs for a new question-answering task, Attributed Question Answering (AQA).
Python
42
star
56

indic-gen-bench

IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.
41
star
57

uibert

It includes two datasets that are used in the downstream tasks for evaluating UIBert: App Similar Element Retrieval data and Visual Item Selection (VIS) data. Both datasets are written TFRecords.
41
star
58

sanpo_dataset

Python
39
star
59

eev

The Evoked Expressions in Video dataset contains videos paired with the expected facial expressions over time exhibited by people reacting to the video content.
35
star
60

NewSHead

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.
35
star
61

noun-verb

This dataset contains naturally-occurring English sentences that feature non-trivial noun-verb ambiguity.
35
star
62

global_streamflow_model_paper

Jupyter Notebook
34
star
63

TF-IDF-IIF-top100-wordlists

These are lists for a variety of languages containing words that are distinctive to each language.
34
star
64

Image-Caption-Quality-Dataset

A dataset of crowdsourced ratings for machine-generated image captions
33
star
65

QAmeleon

QAmeleon introduces synthetic multilingual QA data using PaLM, a 540B large language model. This dataset was generated by prompt tuning PaLM with only five examples per language. We use the synthetic data to finetune downstream QA models leading to improved accuracy in comparison to English-only and translation-based baselines.
33
star
66

Hinglish-TOP-Dataset

Consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentation technique. Queries are derived from TOPv2, a multi-domain task oriented semantic parsing dataset. Tests suggest that with CST5, up to 20x less labeled data can achieve the same semantic parsing performance.
33
star
67

discofuse

32
star
68

seegull

SeeGULL is a broad-coverage stereotype dataset in English containing stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents, as well as state-level identities within the US and India.
32
star
69

NewsQuizQA

NewsQuizQA is a quiz-style question-answer dataset used for generating quiz questions about the news
31
star
70

turkish-treebanks

A human-annotated morphosyntactic treebank for Turkish.
Python
31
star
71

eth_py150_open

A redistributable subset of the ETH Py150 corpus [https://www.sri.inf.ethz.ch/py150], introduced in the ICML 2020 paper 'Learning and Evaluating Contextual Embedding of Source Code' [https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf].
29
star
72

MultiReQA

We are creating a challenging new benchmark MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models. Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus. MultiReQA is a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets from the MRQA shared task. We believe that MultiReQA tests retrieval QA models’ ability to perform domain transfer tasks. This repository hosts the codes to convert existing QA datasets from MRQA shared task to the format of MultiReQA benchmark, as well as the sentence boundary annotations for QA datasets to exactly reproduce our work. Note that we are not redistributing the content in the original datasets available on MRQA share task, but just the sentence boundary annotations.
29
star
73

seq2act

This repository contains the opensource version of the datasets were used for different parts of training and testing of models that ground natural language to UI actions as described in the paper: "Mapping Natural Language Instructions to Mobile UI Action Sequences" by Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge, which is accepted in 2020 Annual Conference of the Association for Computational Linguistics (ACL 2020)
29
star
74

AIS

AIS is an evaluation framework for assessing whether the output of natural language models only contains information about the external world that is verifiable in source documents, or "Attributable to Identified Sources".
29
star
75

wikifact

Wikipedia based dataset to train relationship classifiers and fact extraction models
24
star
76

ccpe

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'. The 'assistant' elicits the 'user’s' preferences about movies following a Coached Conversational Preference Elicitation (CCPE) method. The assistant asks questions designed to minimize the bias in the terminology the 'user' employs to convey his or her preferences as much as possible, and to obtain these preferences in natural language. Each dialog is annotated with entity mentions, preferences expressed about entities, descriptions of entities provided, and other statements of entities.
24
star
77

dices-dataset

This repository contains two datasets with multi-turn adversarial conversations generated by human agents interacting with a dialog model and rated for safety by two corresponding diverse rater pools.
23
star
78

Video-Timeline-Tags-ViTT

A collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description
23
star
79

great

The dataset for the variable-misuse task, used in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/forum?id=B1lnbRNtwr]
22
star
80

nyt-salience

Automatically exported from code.google.com/p/nyt-salience
22
star
81

answer-equivalence-dataset

This dataset contains human judgements about answer equivalence. The data is based on SQuAD (Stanford Question Answering Dataset), and contains 9k human judgements of answer candidates generated by Albert on the SQuAD train set, and an additional 14k human judgements for answer candidates produced by BiDAF, Luke, and XLNet on the SQuAD dev set.
Jupyter Notebook
21
star
82

WebRED

WebRED is a large and diverse manually annotated dataset for extracting relationships from a variety of text found on the World Wide Web.
20
star
83

adversarial-nibbler

This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).
20
star
84

circa

Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to polar questions. The dataset contains pairs of yes/no questions and indirect answers, together with annotations for the interpretation of the answer. The data is collected in 10 different social conversation situations (eg. food preferences of a friend).
20
star
85

rico_semantics

Consists of ~500k human annotations on the RICO dataset identifying various icons based on their shapes and semantics, and associations between selected general UI elements and their text labels. Annotations also include human annotated bounding boxes which are more accurate and have a greater coverage of UI elements.
20
star
86

thesios

This repository describes I/O traces of Google storage servers and disks synthesized by Thesios. Thesios synthesizes representative I/O traces by combining down-sampled I/O traces collected from multiple disks (HDDs) attached to multiple storage servers in Google distributed storage system.
18
star
87

distribution-over-quantities

18
star
88

DaTaSeg-Objects365-Instance-Segmentation

We release the DaTaSeg Objects365 Instance Segmentation Dataset introduced in the DaTaSeg paper, which can be used as an evaluation benchmark for weakly or semi supervised segmentation.
Jupyter Notebook
16
star
89

birds-to-words

16
star
90

PropSegmEnt

PropSegmEnt is an annotated dataset for segmenting English text into propositions, and recognizing proposition-level entailment relations - whether a different, related document entails each proposition, contradicts it, or neither. It consists of clusters of closely related documents from the news and Wikipedia domains.
16
star
91

widget-caption

The dataset includes widget captions that describes UI element's functionalities. It is used for training and evaluation of the widget captioning model (please see the EMNLP'20 paper: https://arxiv.org/abs/2010.04295).
16
star
92

common-crawl-domain-names

Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
16
star
93

2.5vrd

This dataset contains about 110k images annotated with the depth and occlusion relationships between arbitrary objects. It enables research on the 2.5D Visual Relationship Detection (2.5VRD) introduced in https://arxiv.org/abs/2104.12727.
15
star
94

maverics

MAVERICS (Manually-vAlidated Vq^2a Examples fRom Image-Caption datasetS) is a suite of test-only benchmarks for visual question answering (VQA).
14
star
95

lareqa

LAReQA is a challenging benchmark for evaluating language agnostic answer retrieval from a multilingual candidate pool. This repository contains a dataset we release as part of the LAReQA evaluation.
14
star
96

Textual-Entailment-New-Protocols

This data release is meant to accompany and document the paper: https://arxiv.org/abs/2004.11997 Collecting Entailment Data for Pretraining: New Protocols and Negative Results by Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, and Emily Pitler
14
star
97

recognizing-multimodal-entailment

The dataset consists of public social media url pairs and the corresponding entailment label for an external conference (ACL 2021). Each url contains a post with both linguistic (text) and visual (image) content. Entailment labels are human annotated through Google Crowdsource.
Jupyter Notebook
13
star
98

nlp-fairness-for-india

Contains data resources to replicate results from the paper “Re-contextualizing Fairness in NLP: The Case of India”.
12
star
99

aart-ai-safety-dataset

AART: AI-Assisted Red-Teaming with Diverse Data Generation for New LLM-powered Applications
12
star
100

maxm

MaXM is a suite of test-only benchmarks for multilingual visual question answering in 7 languages: English (en), French (fr), Hindi (hi), Hebrew (iw), Romanian (ro), Thai (th), and Chinese (zh).
12
star