ToTTo Dataset
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description.
During the dataset creation process, tables from English Wikipedia are matched with (noisy) descriptions. Each table cell mentioned in the description is highlighted and the descriptions are iteratively cleaned and corrected to faithfully reflect the content of the highlighted cells.
We hope this dataset can serve as a useful research benchmark for high-precision conditional text generation.
You can find more details, analyses, and baseline results in our paper. You can cite it as follows:
@inproceedings{parikh2020totto,
title={{ToTTo}: A Controlled Table-To-Text Generation Dataset},
author={Parikh, Ankur P and Wang, Xuezhi and Gehrmann, Sebastian and Faruqui, Manaal and Dhingra, Bhuwan and Yang, Diyi and Das, Dipanjan},
booktitle={Proceedings of EMNLP},
year={2020}
}
Getting Started
Download the ToTTo data
The ToTTo dataset is released under the Creative Commons Share-Alike 3.0 license.
To download the data from the command line:
wget https://storage.googleapis.com/totto-public/totto_data.zip
unzip totto_data.zip
(or alternatively copy the above url into your browser address bar.)
Inside the totto_data
directory you should see three files: totto_train_data.jsonl
, totto_dev_data.jsonl
, and unlabeled_totto_test_data.jsonl
for the training, development, and unlabeled test sets respectively.
Download the evaluation scripts
You can find evaluation scripts and some exploratory processing scripts at this repository. It also includes a separate README file with instruction on how to run the evaluation.
Dataset Description
The ToTTo dataset consists of three .jsonl
files, where each line is a JSON dictionary with the following format:
{
"table_page_title": "'Weird Al' Yankovic",
"table_webpage_url": "https://en.wikipedia.org/wiki/%22Weird_Al%22_Yankovic",
"table_section_title": "Television",
"table_section_text": "",
"table": "[Described below]",
"highlighted_cells": [[22, 2], [22, 3], [22, 0], [22, 1], [23, 3], [23, 1], [23, 0]],
"example_id": 12345678912345678912,
"sentence_annotations": [{"original_sentence": "In 2016, Al appeared in 2 episodes of BoJack Horseman as Mr. Peanutbutter's brother, Captain Peanutbutter, and was hired to voice the lead role in the 2016 Disney XD series Milo Murphy's Law.",
"sentence_after_deletion": "In 2016, Al appeared in 2 episodes of BoJack Horseman as Captain Peanutbutter, and was hired to the lead role in the 2016 series Milo Murphy's Law.",
"sentence_after_ambiguity": "In 2016, Al appeared in 2 episodes of BoJack Horseman as Captain Peanutbutter, and was hired for the lead role in the 2016 series Milo Murphy's 'Law.",
"final_sentence": "In 2016, Al appeared in 2 episodes of BoJack Horseman as Captain Peanutbutter and was hired for the lead role in the 2016 series Milo Murphy's Law."}],
}
The table
field is a List[List[Dict]]
. The outer lists represents rows and the inner lists columns. Each Dict
has the fields column_span: int
, is_header: bool
, row_span: int
, and value: str
. The first two rows for the example above look as follows:
[
[
{ "column_span": 1,
"is_header": true,
"row_span": 1,
"value": "Year"},
{ "column_span": 1,
"is_header": true,
"row_span": 1,
"value": "Title"},
{ "column_span": 1,
"is_header": true,
"row_span": 1,
"value": "Role"},
{ "column_span": 1,
"is_header": true,
"row_span": 1,
"value": "Notes"}
],
[
{ "column_span": 1,
"is_header": false,
"row_span": 1,
"value": "1997"},
{ "column_span": 1,
"is_header": false,
"row_span": 1,
"value": "Eek! The Cat"},
{ "column_span": 1,
"is_header": false,
"row_span": 1,
"value": "Himself"},
{ "column_span": 1,
"is_header": false,
"row_span": 1,
"value": "Episode: 'The FugEektive'"}
], ...
]
-The table metadata consists of the table_page_title
, table_section_title
, and table_section_text
strings to help give the model more context about the table.
-The highlighted_cells
field is a List[[row_index, column_index]]
where each [row_index, column_index]
pair indicates that table[row_index][column_index]
is highlighted.
-The example_id
is simply a unique id for this example.
-The sentence_annotations
field consists of the original sentence
and the sequence of revised sentences performed in order to produce the final_sentence
. See our paper for more details.
To help understand the dataset, you can find a sample of the train and dev sets in the sample/
folder of our supplementary repository. It additionally provides the create_table_to_text_html.py
script that visualizes examples, the output of which you can also find in the sample/
folder.
Official Task
The official task described in our paper is given the table
, highlighted cells
, and table metadata (table_page_title
, table_section_title
, and table_section_text
) as input, to generate the final_sentence
.
Dev and Test Set
The dev and test set have between two and three references for each example, which are added to the list at the sentence_annotations
key. The test set annotations are private and thus not included in the data.
If you want us to evaluate your model on the development or the private test set, please submit your files here. You can find more submission information below. By emailing us or by submitting prediction files, you consent to being contacted by Google about your submission, this dataset or any related competitions.
We provide two splits within the dev and test sets - one uses previously seen combinations of table headers and one uses unseen combinations. The sets are marked using the overlap_subset: bool
flag that is added to the JSON representation. By filtering the evaluation to examples with the flag set to true
, you will be able to test the generalization ability of your model.
Leaderboard
We are maintaining a leaderboard with official results on our test set.
The leaderboard indicates whether or not a model was trained on any auxiliary Wikipedia data. This is because our tables and (unrevised) test targets are from Wikipedia and thus we would like to study the effect of using additional Wikipedia data to train models.
We ask you to not incorporate any part of the ToTTo development set into the training data, and only use it for validation/hyperparameter tuning as development sets are typically used.
In addition to BLEU and PARENT, we also report a learnt metric BLEURT. The checkpoint used was BLEURT-base-128 which can be found here. To handle multiple references, we take the average of the scores as suggested by Sellam et al. 2020.
Overall | Overlap Subset | Non-Overlap Subset | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Model | Link | Uses Wiki | BLEU | PARENT | BLEURT | BLEU | PARENT | BLEURT | BLEU | PARENT | BLEURT |
LATTICE | [Wang et al. 2022] | yes | 48.4 | 58.1 | 0.222 | 56.1 | 62.4 | 0.345 | 40.4 | 53.9 | .099 |
SKY | in preparation | yes | 49.9 | 59.8 | 0.212 | 57.8 | 64.0 | 0.334 | 42.0 | 55.7 | 0.091 |
CoNT | [An et al., 2022] | yes | 49.1 | 58.9 | 0.238 | 56.7 | 63.2 | 0.355 | 41.3 | 54.6 | 0.121 |
Supervised+NLPO | [Ramamurthy et al. 2022] | yes | 47.4 | 59.6 | 0.192 | 55.0 | 64.3 | 0.315 | 39.2 | 55.0 | 0.068 |
Anonymous 3 | in preparation | yes | 49.3 | 58.8 | 0.235 | 57.1 | 63.4 | 0.358 | 41.5 | 54.1 | 0.112 |
ProEdit | Paper in preparation | yes | 48.6 | 59.18 | 0.202 | 55.9 | 63.3 | 0.325 | 41.3 | 55.1 | 0.078 |
Anonymous 2 | Paper in preparation | yes | 49.4 | 59.0 | 0.253 | 57.0 | 62.9 | 0.370 | 41.7 | 55.1 | 0.136 |
PlanGen (University of Cambridge, Apple) | [Su et al. 2021] | yes | 49.2 | 58.7 | 0.249 | 56.9 | 62.8 | 0.371 | 41.5 | 54.6 | 0.126 |
T5-based (Google) | [Kale, 2020] | yes | 49.5 | 58.4 | 0.230 | 57.5 | 62.6 | 0.351 | 41.4 | 54.2 | 0.1079 |
BERT-to-BERT (Wiki+Books) | [Rothe et al., 2019] | yes | 44.0 | 52.6 | 0.121 | 52.7 | 58.4 | 0.259 | 35.1 | 46.8 | -0.017 |
BERT-to-BERT (Books) | [Rothe et al., 2019] | no | 43.9 | 52.6 | 0.104 | 52.7 | 58.4 | 0.255 | 34.8 | 46.7 | -0.046 |
Pointer Generator | [See et al., 2017] | no | 41.6 | 51.6 | 0.076 | 50.6 | 58.0 | 0.244 | 32.2 | 45.2 | -0.0922 |
Content Planner | [Puduppully et al., 2019] | no | 19.2 | 29.2 | -0.576 | 24.5 | 32.5 | -0.491 | 13.9 | 25.8 | -0.662 |
Leaderboard Submission
If you want to submit dev and test outputs, please format your predictions as a single .txt
file with line-separated predictions. The predictions should be in the same order as the examples in the test.jsonl
file.
You can upload your prediction files here and email us at [email protected] to tell us you have submitted. By emailing us or by submitting prediction files, you consent to being contacted by Google about your submission, this dataset or any related competitions.