• Stars
    star
    183
  • Rank 205,850 (Top 5 %)
  • Language
  • Created over 4 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.

Dakshina Dataset

The Dakshina dataset is a collection of text in both Latin and native scripts for 12 South Asian languages. For each language, the dataset includes a large collection of native script Wikipedia text, a romanization lexicon which consists of words in the native script with attested romanizations, and some full sentence parallel data in both a native script of the language and the basic Latin alphabet.

Dataset URL: https://github.com/google-research-datasets/dakshina

If you use or discuss this dataset in your work, please cite our paper (bibtex citation below). A PDF link for the paper can be found at https://www.aclweb.org/anthology/2020.lrec-1.294.

@inproceedings{roark-etal-2020-processing,
    title = "Processing {South} {Asian} Languages Written in the {Latin} Script:
    the {Dakshina} Dataset",
    author = "Roark, Brian and
      Wolf-Sonkin, Lawrence and
      Kirov, Christo and
      Mielke, Sabrina J. and
      Johny, Cibu and
      Demir{\c{s}}ahin, I{\c{s}}in and
      Hall, Keith",
    booktitle = "Proceedings of The 12th Language Resources and Evaluation Conference (LREC)",
    year = "2020",
    url = "https://www.aclweb.org/anthology/2020.lrec-1.294",
    pages = "2413--2423"
}

Data links

File Download Version Date Notes
dakshina_dataset_v1.0.tar link 1.0 05/27/2020 Initial data release

Data Organization

There are 12 languages represented in the dataset: Bangla (bn), Gujarati (gu), Hindi (hi), Kannada (kn), Malayalam (ml), Marathi (mr), Punjabi (pa), Sindhi (sd), Sinhala (si), Tamil (ta), Telugu (te) and Urdu (ur).

All data is derived from Wikipedia text. Each language has its own directory, in which there are three subdirectories:

Native Script Wikipedia {#native}

In the native_script_wikipedia subdirectories there are native script text strings from Wikipedia. The scripts are:

  • For bn, gu, kn, ml, si, ta and te, the scripts are named the same as the language,
  • hi and mr are in the Devanagari script,
  • pa is in the Gurmukhi script, and
  • ur and sd are in Perso-Arabic scripts.

All of the scripts other than the Perso-Arabic scripts are Brahmic. This data consists of Wikipedia strings that have been filtered (see below) to consist only of strings primarily in the Unicode codeblock for the script, plus whitespace and, in some cases, commonly used ASCII punctuation and digits. The pages from which the strings come from have been split into training and validation sets, so that no strings in the training partition come from Wikipedia pages from which validation strings are extracted. Files have been gzipped, and have accompanying information that permits linking strings back to their original Wikipedia pages. For example, the first line of mr/native_script_wikipedia/mr.wiki-filt.train.text.shuf.txt.gz contains:

कोल्हापुरात मिळणारा तांबडा पांढरा रस्सा कुठेच मिळत नाही.

Lexicons {#lexicons}

In the lexicons subdirectories there are lexicons of words in the native script of each language alongside human-annotated possible romanizations for the word. The words in the lexicons are all sampled from words that occurred more than once in the Wikipedia training sets, in the native_script_wikipedia subdirectories, and most received a romanization from more than one annotator, though the annotated romanizations may agree. These are in a format similar to pronunciation lexicons, i.e., single (word, romanization) pair per line in a TSV file, with an additional column indicating the number of attestations for the pair. For example, the first two lines of the file pa/lexicons/pa.translit.sampled.train.tsv contains:

ਅਂਦਾਜਾ	andaaja	1
ਅਂਦਾਜਾ	andaja	2

i.e., two different possible romanizations for the Punjabi word ਅਂਦਾਜਾ, one possible romanization (andaaja) attested once, the other (andaja) twice. For convenience, each lexicon has been partitioned into training, development and testing sets, with partitioning by native script word, so that words in the training set do not occur in the development or testing sets. In addition, we

used some automated methods to identify lemmata (see below) in each word, and ensured that lemmata in words in the development and test sets were unobserved in the training set. All native script characters -- specifically, all native script Unicode codepoints -- in the development and test sets are found in the training set. See below for further details on data elicitation and preparation. For each language there are *.train.tsv, *.dev.tsv and *.test.tsv files in the subdirectory. For all languages except for Sindhi (sd), there are 25,000 (native script) word types in the training lexicon, and 2,500 in each of the dev and test lexicons. Sindhi also has 2,500 native script word types in the dev and test lexicons, but just 15,000 in the training lexicon.

Romanized {#romanized}

In the romanized subdirectory, we have manually romanized full strings, alongside the original native script prompts for the examples. The native script prompts were selected from the validation sets in the native_script_wikipedia subdirectories (see description of preprocessing below). 10,000 strings from each native script validation set were randomly chosen to be romanized by native speaker annotators. For long sentences (more than 30 words), the sentences were segmented into shorter fragments (by splitting in half until fragments are < 30 words), and each fragment romanized independently, for ease of annotation. From this process, there are *.split.tsv and *.rejoined.tsv, which contain native script and romanized strings in the two (tab delimited) fields. (Files with 'split' are the versions with strings >= 30 segmented; those with 'rejoined' are not length segmented.) For example, the first line of hi/romanized/hi.romanized.rejoined.tsv contains:

जबकि यह जैनों से कम है।	Jabki yah Jainon se km hai.

Additionally, for convenience, we performed an automatic (white space) token-level alignment of the strings, with one aligned token per line, as well as an end-of-string marker </s>. In the case that the tokenization is not 1-1, multiple tokens are left on the same line. These alignments are provided also with the Latin script de-cased and punctuation removed, e.g., the first seven lines of the file hi/romanized/hi.romanized.rejoined.aligned.cased_nopunct.tsv are:

जबकि	jabki
यह	yah
जैनों	jainon
से	se
कम	km
है	hai
</s>	</s>

We also performed a validation of the romanizations, by requesting that different annotators transcribe the romanized strings into the native script of each language respectively (see details below). The resulting native script transcriptions are provided (*.split.validation.native.txt) for each language, along with a file (*.split.validation.edits.txt) that provides counts of (1) the total number of reference characters (in the original native-script strings), (2) substitutions, (3) deletions and (4) insertions in the validation transcriptions. For example, the first two lines of the file bn/romanized/bn.romanized.split.validation.edits.txt are:

 LINE REF SUB DEL INS
    1 126   3   3   0

which indicates that the first native script string in bn/romanized/bn.romanized.split.tsv has 126 characters, and there were 3 substitutions, 3 deletions and 0 insertions in the native script string transcribed by annotators during the validation phase. Note that the comparison involved some script normalization of visually identical sequences to minimize spurious errors, as described in more detail below. All languages fell between 3.5 and 8.5 percent character error rates of the validation text. See below for further details on this validation process.

Finally, for convenience, we randomly shuffled this set and divided into development and test sets, each of which are broken into native and Latin script text files. Thus the first line in the file si/romanized/si.romanized.rejoined.dev.native.txt is:

වැව්වල ඇළෙවිලි වැව ඉහත්තාව, වේල්ල ආරක්ෂා කිරිමට එකල සියල්ලෝම බැදි සිටියෝය.

and the first line of si/romanized/si.romanized.rejoined.dev.roman.txt is:

vevvala eleveli, veva ihatthava, vella araksha kirimata ekala siyalloma bendi sitiyaya.

Note that several hundred strings from the Urdu Wikipedia sample (and one from Sindhi) were not from those languages, rather from other languages using a Perso-Arabic script, e.g., Arabic, Punjabi or others. Those were excluded for those sets, leading to less than 10,000 romanized strings.

Native script data preprocessing {#native-preprocessing}

Let $L be the language code, one of bn, gu, hi, kn, ml, mr, pa, sd, si, ta, te, or ur. The native script files are in $L/native_script_wikipedia. All URLs of Wikipedia pages are included in $L.wiki-full.urls.tsv.gz. This tab delimited file includes four fields: page ID, revision ID, base URL, and URL with revision ID.

We omitted whole pages that were any of the following:

  1. redirected pages.
  2. pages with infoboxes about settlements or jurisdictions.
  3. pages with state=collapsed or expanded or autocollapse
  4. pages referring to censusindia or en.wikipedia.org.
  5. pages with wikitable.
  6. pages with lists containing more than 7 items.

Indices of pages omitted are given in $L.wiki-full.omit_pages.txt.gz.

For pages that are not omitted, we extract text and info files:

  • $L.wiki-full.text.sorted.tsv.gz
  • $L.wiki-full.info.sorted.tsv.gz

Text is organized by page and section within page. We then:

  1. split section text by newline (leading to multiple strings per section).
  2. NFC normalize.
  3. sentence segment using ICU sentence segmentation (leading to multiple sentences per string). The ICU sentence segmenter is initialized with the locale associated with the specific language being segmented.

Both tab delimited files share the same initial 6 fields: page_id, section_index, string_index (in section), sentence_index (in string), include_bool, and text_freq, where the include_bool indicates whether to include the string or not (see below), and text_freq is the count of the full section text string in the whole collection. The latter enables us to find repeated strings as the means for determining boilerplate sections and other things to exclude.

Both files are sorted numerically (descending) by the first three fields.

The final (7th) field of $L.wiki-full.text.sorted.tsv.gz is the text.

The remaining fields of $L.wiki-full.info.sorted.tsv.gz are: (7) depth of the section in the page; (8) the heading level of the section; (9) the section index of the parent section; (10) the number of words in the text; (11) the number of Unicode codepoints in the text; (12) the percentage of Unicode codepoints falling in category A (described below); (13) the percentage of Unicode codepoints falling in category B (described below); and (14) the section title.

For a given native script Unicode block, we define categories A and B as follows. First, we identify a subset of codepoints as special symbols, call them non-letter symbols: non-letter ASCII codepoints; Arabic full stop; Devanagari Danda; any codepoint in the General Punctuation block; and any digits in the current native script Unicode block. Category A are those codepoints that (1) are outside of the native script Unicode block; and (2) are not in the non-letter subset of codepoints. Category B are all codepoints within the native script Unicode block.

The above-mentioned include_bool is set to true (when filtering) if: the percentage of category A is below a threshold; the percentage of category B is above a second threshold; and, finally, the percentage of whitespace-delimited words that contain at least one codepoint from the current native script Unicode block (and not in the non-letter subset) is above the same threshold as category B codepoints.

For each non-empty section title, we calculate the total number of Unicode codepoints, the total number of category A codepoints, and the fraction of codepoints that are category A, for all sections with that title. These statistics are stored in $L.wiki-full.nonblock.sections.tsv.gz, which is sorted in descending order by total category A codepoints. Thus, the first line of hi.wiki-full.nonblock.sections.tsv.gz shows the section title with the most category A characters:

$ gzip -cd hi.wiki-full.nonblock.sections.tsv.gz | head -1
6387096.000260	12141241	0.526066	सन्दर्भ

It's unsurprising the a section titled सन्दर्भ ('references') would have so much non-codeblock text (mainly ASCII). It also illustrates why we track this statistic, since we do not want to include references in the text that we are extracting. To avoid such sections, we create a list of sections where the aggregate percentage of category A codepoints in sections with that title is greater than 20%. These omitted section titles are in $L.wiki-full.nonblock.sections.list.txt.gz.

A second round of text extraction then occurs, omitting text occurring in the aforementioned sections and including only individual strings that consist of at least 85% category B codepoints, at most 10% category A codepoints, and at least 85% of white-space delimited words containing a within-codeblock (and not non-letter) codepoint.

All text that is extracted from a given Wikipedia page is collectively placed in either a training or a validation set, i.e., no strings in the validation set share a Wikipedia page with any string in the training set. Between 23 and 29 thousand strings are placed in each validation set, which represents a minimum of 2.25% of the data and a maximum of 26% of the data.

The data from this second iteration of extraction is present in:

  • $L.wiki-filt.train.info.sorted.tsv.gz
  • $L.wiki-filt.train.text.sorted.tsv.gz
  • $L.wiki-filt.train.text.shuf.txt.gz
  • $L.wiki-filt.valid.info.sorted.tsv.gz
  • $L.wiki-filt.valid.text.sorted.tsv.gz
  • $L.wiki-filt.valid.text.shuf.txt.gz

The first three files are training set files, the final three are validation. The info.sorted and text.sorted files have index sorted data along the lines described above for the full set, for both training and validation sets. We additionally randomly shuffled the text from both sets, found in text.shuf.

Annotation

Validation string selection criteria for romanization

We randomly selected 10,000 strings from the validation set detailed above for romanization by annotators. As detailed earlier, for strings with >= 30 Unicode codepoints, we segmented into shorter strings for ease of annotation.

Round-trip romanization validation {#round-trip-validation}

After eliciting manual romanizations for each of the 10,000 strings in the selected validation sets in each language, we validated the resulting romanizations via a second round of annotations, where the romanized strings were provided to annotators and they were tasked with producing the strings in the native script, which was then compared with the original string. To compare the strings, we performed a visual normalization of both the original and validation native script strings, so that visually identical strings were encoded with the same codepoints for comparison. We then calculated the number of character substitutions, deletions and insertions for each string in the Viterbi alignment between the visually normalized original and validation strings, including whitespace and punctuation. This, along with the count of characters in the reference (visually normalized original string) allows for calculation of character error rates.

As stated above, the languages all fell between 3.5 and 8.5 percent character error rate. The error rate could not have been 0 for a variety of reasons:

  • Some non-codeblock text was allowed in the original native script strings, e.g., individual words in the Latin script, something that annotators with access only to the romanizations could not recover;
  • Digit strings are typically variously realized with Latin and native script digits, which is also not recoverable;
  • In the Perso-Arabic script in particular, tokenization in native and Latin scripts may be different, leading to whitespace character mismatch; similarly, punctuation placement sometimes leads to different tokenization;
  • Errors occurring in either the original Wikipedia or validation strings;
  • Visually identical strings can be encoded with different Unicode codepoint sequences, something we controlled to some extent with visual normalization, but other correspondences may occur; and
  • Valid spelling variation exists in the languages, e.g., for English loanwords, but also for common words such as "Hindi" in Devanagari, which can be equally well realized as either हिन्दी or हिंदी.

We provide the validation strings and character edits per string to permit users of the resource to potentially explore methods that take such information into account, e.g., for model evaluation.

License

The dataset is licensed under CC BY-SA 4.0. Any third party content or data is provided "As Is" without any warranty, express or implied.

Contacts

  • roark [at] google.com
  • ckirov [at] google.com
  • wolfsonkin [at] google.com

More Repositories

1

Objectron

Objectron is a dataset of short, object-centric video clips. In addition, the videos also contain AR session metadata including camera poses, sparse point-clouds and planes. In each video, the camera moves around and above the object and captures it from different views. Each object is annotated with a 3D bounding box. The 3D bounding box describes the object’s position, orientation, and dimensions. The dataset contains about 15K annotated video clips and 4M annotated images in the following categories: bikes, books, bottles, cameras, cereal boxes, chairs, cups, laptops, and shoes
Jupyter Notebook
2,213
star
2

wit

WIT (Wikipedia-based Image Text) Dataset is a large multimodal multilingual dataset comprising 37M+ image-text sets with 11M+ unique images across 100+ languages.
965
star
3

natural-questions

Natural Questions (NQ) contains real user questions issued to Google search, and answers found from Wikipedia by annotators. NQ is designed for the training and evaluation of automatic question answering systems.
Python
894
star
4

paws

This dataset contains 108,463 human-labeled and 656k noisily labeled pairs that feature the importance of modeling structure, context, and word order information for the problem of paraphrase identification.
Python
539
star
5

dstc8-schema-guided-dialogue

The Schema-Guided Dialogue Dataset
Python
526
star
6

conceptual-captions

Conceptual Captions is a dataset containing (image-URL, caption) pairs designed for the training and evaluation of machine learned image captioning systems.
Shell
495
star
7

ToTTo

ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples that proposes a controlled generation task: given a Wikipedia table and a set of highlighted table cells, produce a one-sentence description. We hope it can serve as a useful research benchmark for high-precision conditional text generation.
422
star
8

conceptual-12m

Conceptual 12M is a dataset containing (image-URL, caption) pairs collected for vision-and-language pre-training.
329
star
9

tydiqa

TyDi QA contains 200k human-annotated question-answer pairs in 11 Typologically Diverse languages, written without seeing the answer and without the use of translation, and is designed for the training and evaluation of automatic question answering systems. This repository provides evaluation code and a baseline system for the dataset.
Python
288
star
10

wiki-reading

This repository contains the three WikiReading datasets as used and described in WikiReading: A Novel Large-scale Language Understanding Task over Wikipedia, Hewlett, et al, ACL 2016 (the English WikiReading dataset) and Byte-level Machine Reading across Morphologically Varied Languages, Kenter et al, AAAI-18 (the Turkish and Russian datasets).
Python
270
star
11

hiertext

The HierText dataset contains ~12k images from the Open Images dataset v6 with large amount of text entities. We provide word, line and paragraph level annotations.
Jupyter Notebook
241
star
12

coarse-discourse

A large corpus of discourse annotations and relations on ~10K forum threads.
Python
238
star
13

simulated-dialogue

226
star
14

gap-coreference

GAP is a gender-balanced dataset containing 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name), sampled from Wikipedia for the evaluation of coreference resolution in practical applications.
Python
223
star
15

KELM-corpus

207
star
16

Taskmaster

Please see the readme file as well as our 2019 EMNLP paper linked here -->
190
star
17

word_sense_disambigation_corpora

SemCor and Masc documents annotated with NOAD word senses.
182
star
18

cvss

CVSS: A Massively Multilingual Speech-to-Speech Translation Corpus
169
star
19

C4_200M-synthetic-dataset-for-grammatical-error-correction

This dataset contains synthetic training data for grammatical error correction. The corpus is generated by corrupting clean sentences from C4 using a tagged corruption model. The approach and the dataset are described in more detail by Stahlberg and Kumar (2021) (https://www.aclweb.org/anthology/2021.bea-1.4/)
Python
152
star
20

Nutrition5k

Detailed visual + nutritional data for over 5,000 plates of food.
Python
144
star
21

boolean-questions

139
star
22

MAVE

The dataset contains 3 million attribute-value annotations across 1257 unique categories on 2.2 million cleaned Amazon product profiles. It is a large, multi-sourced, diverse dataset for product attribute extraction study.
Python
134
star
23

wiki-split

One million English sentences, each split into two sentences that together preserve the original meaning, extracted from Wikipedia edits.
121
star
24

sentence-compression

Large corpus of uncompressed and compressed sentences from news articles.
121
star
25

tpu_graphs

C++
120
star
26

QED

QED: A Framework and Dataset for Explanations in Question Answering
Python
114
star
27

presto

A Multilingual Dataset for Parsing Realistic Task-Oriented Dialogs
108
star
28

RxR

Room-across-Room (RxR) is a large-scale, multilingual dataset for Vision-and-Language Navigation (VLN) in Matterport3D environments. It contains 126k navigation instructions in English, Hindi and Telugu, and 126k navigation following demonstrations. Both annotation types include dense spatiotemporal alignments between the text and the visual perceptions of the annotators
Python
105
star
29

wiki-atomic-edits

A dataset of atomic wikipedia edits containing insertions and deletions of a contiguous chunk of text in a sentence. This dataset contains ~43 million edits across 8 languages.
104
star
30

clang8

cLang-8 is a dataset for grammatical error correction.
Python
99
star
31

seahorse

Seahorse is a dataset for multilingual, multi-faceted summarization evaluation. It consists of 96K summaries with human ratings along 6 quality dimensions: comprehensibility, repetition, grammar, attribution, main idea(s), and conciseness, covering 6 languages, 9 systems and 4 datasets.
83
star
32

query-wellformedness

25,100 queries from the Paralex corpus (Fader et al., 2013) annotated with human ratings of whether they are well-formed natural language questions.
83
star
33

xsum_hallucination_annotations

Faithfulness and factuality annotations of XSum summaries from our paper "On Faithfulness and Factuality in Abstractive Summarization" (https://www.aclweb.org/anthology/2020.acl-main.173.pdf).
80
star
34

videoCC-data

VideoCC is a dataset containing (video-URL, caption) pairs for training video-text machine learning models. It is created using an automatic pipeline starting from the Conceptual Captions Image-Captioning Dataset.
73
star
35

TimeDial

Temporal Commonsense Reasoning in Dialog
68
star
36

vrdu

We identify the desiderata for a comprehensive benchmark and propose Visually Rich Document Understanding (VRDU). VRDU contains two datasets that represent several challenges: rich schema including diverse data types, complex templates, and diversity of layouts within a single document type.
67
star
37

uninum

A database of number names for 186 languages, locales, and scripts
66
star
38

screen_qa

ScreenQA dataset was introduced in the "ScreenQA: Large-Scale Question-Answer Pairs over Mobile App Screenshots" paper. It contains ~86K question-answer pairs collected by human annotators for ~35K screenshots from Rico. It should be used to train and evaluate models capable of screen content understanding via question answering.
65
star
39

TextNormalizationCoveringGrammars

Covering grammars for English and Russian text normalization
Makefile
60
star
40

Disfl-QA

A Benchmark Dataset for Understanding Disfluencies in Question Answering
59
star
41

relation-extraction-corpus

Automatically exported from code.google.com/p/relation-extraction-corpus
55
star
42

WikipediaHomographData

Labeled data for homograph disambiguation
52
star
43

scin

The SCIN dataset contains 10,000+ images of dermatology conditions, crowdsourced with informed consent from US internet users. Contributions include self-reported demographic and symptom information and dermatologist labels. The dataset also contains estimated Fitzpatrick skin type and Monk Skin Tone.
Jupyter Notebook
52
star
44

GSM-IC

Grade-School Math with Irrelevant Context (GSM-IC) benchmark is an arithmetic reasoning dataset built upon GSM8K, by adding irrelevant sentences in problem descriptions. GSM-IC is constructed to evaluate the distractibility of language models.
49
star
45

Crisscrossed-Captions

Extended Intramodal and Intermodal Semantic Similarity Judgments for MS-COCO
Python
48
star
46

bam

Python
48
star
47

synthetic-fur

A procedurally generated synthetic fur dataset with conditional inputs for machine learning and neural rendering.
46
star
48

wiki-links

Automatically exported from code.google.com/p/wiki-links
42
star
49

Synthetic-Persona-Chat

The Synthetic-Persona-Chat dataset is a synthetically generated persona-based dialogue dataset. It extends the original Persona-Chat dataset.
Python
41
star
50

swim-ir

SWIM-IR is a Synthetic Wikipedia-based Multilingual Information Retrieval training set with 28 million query-passage pairs spanning 33 languages, generated using PaLM 2 and summarize-then-ask prompting.
40
star
51

Attributed-QA

We believe the ability of an LLM to attribute the text that it generates is likely to be crucial for both system developers and users in information-seeking scenarios. This release consists of human-rated system outputs for a new question-answering task, Attributed Question Answering (AQA).
Python
40
star
52

uibert

It includes two datasets that are used in the downstream tasks for evaluating UIBert: App Similar Element Retrieval data and Visual Item Selection (VIS) data. Both datasets are written TFRecords.
39
star
53

sanpo_dataset

Python
38
star
54

screen2words

The dataset includes screen summaries that describes Android app screenshot's functionalities. It is used for training and evaluation of the screen2words models (our paper accepted by UIST'21 will be linked soon).
38
star
55

clay

The dataset includes UI object type labels (e.g., BUTTON, IMAGE, CHECKBOX) that describes the semantic type of an UI object on Android app screenshots. It is used for training and evaluation of the screen layout denoising models (paper will be linked soon).
37
star
56

noun-verb

This dataset contains naturally-occurring English sentences that feature non-trivial noun-verb ambiguity.
36
star
57

NewSHead

The NewSHead dataset is a multi-doc headline dataset used in NHNet for training a headline summarization model.
35
star
58

eev

The Evoked Expressions in Video dataset contains videos paired with the expected facial expressions over time exhibited by people reacting to the video content.
34
star
59

TF-IDF-IIF-top100-wordlists

These are lists for a variety of languages containing words that are distinctive to each language.
33
star
60

QAmeleon

QAmeleon introduces synthetic multilingual QA data using PaLM, a 540B large language model. This dataset was generated by prompt tuning PaLM with only five examples per language. We use the synthetic data to finetune downstream QA models leading to improved accuracy in comparison to English-only and translation-based baselines.
33
star
61

Hinglish-TOP-Dataset

Consists of the largest (10K) human annotated code-switched semantic parsing dataset & 170K generated utterance using the CST5 augmentation technique. Queries are derived from TOPv2, a multi-domain task oriented semantic parsing dataset. Tests suggest that with CST5, up to 20x less labeled data can achieve the same semantic parsing performance.
33
star
62

discofuse

32
star
63

NewsQuizQA

NewsQuizQA is a quiz-style question-answer dataset used for generating quiz questions about the news
31
star
64

turkish-treebanks

A human-annotated morphosyntactic treebank for Turkish.
Python
31
star
65

seegull

SeeGULL is a broad-coverage stereotype dataset in English containing stereotypes about identity groups spanning 178 countries across 8 different geo-political regions across 6 continents, as well as state-level identities within the US and India.
31
star
66

global_streamflow_model_paper

Jupyter Notebook
30
star
67

Image-Caption-Quality-Dataset

A dataset of crowdsourced ratings for machine-generated image captions
30
star
68

eth_py150_open

A redistributable subset of the ETH Py150 corpus [https://www.sri.inf.ethz.ch/py150], introduced in the ICML 2020 paper 'Learning and Evaluating Contextual Embedding of Source Code' [https://proceedings.icml.cc/static/paper_files/icml/2020/5401-Paper.pdf].
29
star
69

MultiReQA

We are creating a challenging new benchmark MultiReQA: A Cross-Domain Evaluation for Retrieval Question Answering Models. Retrieval question answering (ReQA) is the task of retrieving a sentence-level answer to a question from an open corpus. MultiReQA is a new multi-domain ReQA evaluation suite composed of eight retrieval QA tasks drawn from publicly available QA datasets from the MRQA shared task. We believe that MultiReQA tests retrieval QA models’ ability to perform domain transfer tasks. This repository hosts the codes to convert existing QA datasets from MRQA shared task to the format of MultiReQA benchmark, as well as the sentence boundary annotations for QA datasets to exactly reproduce our work. Note that we are not redistributing the content in the original datasets available on MRQA share task, but just the sentence boundary annotations.
29
star
70

seq2act

This repository contains the opensource version of the datasets were used for different parts of training and testing of models that ground natural language to UI actions as described in the paper: "Mapping Natural Language Instructions to Mobile UI Action Sequences" by Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge, which is accepted in 2020 Annual Conference of the Association for Computational Linguistics (ACL 2020)
26
star
71

AIS

AIS is an evaluation framework for assessing whether the output of natural language models only contains information about the external world that is verifiable in source documents, or "Attributable to Identified Sources".
26
star
72

ccpe

A dataset consisting of 502 English dialogs with 12,000 annotated utterances between a user and an assistant discussing movie preferences in natural language. It was collected using a Wizard-of-Oz methodology between two paid crowd-workers, where one worker plays the role of an 'assistant', while the other plays the role of a 'user'. The 'assistant' elicits the 'user’s' preferences about movies following a Coached Conversational Preference Elicitation (CCPE) method. The assistant asks questions designed to minimize the bias in the terminology the 'user' employs to convey his or her preferences as much as possible, and to obtain these preferences in natural language. Each dialog is annotated with entity mentions, preferences expressed about entities, descriptions of entities provided, and other statements of entities.
24
star
73

wikifact

Wikipedia based dataset to train relationship classifiers and fact extraction models
23
star
74

great

The dataset for the variable-misuse task, used in the ICLR 2020 paper 'Global Relational Models of Source Code' [https://openreview.net/forum?id=B1lnbRNtwr]
22
star
75

nyt-salience

Automatically exported from code.google.com/p/nyt-salience
22
star
76

dices-dataset

This repository contains two datasets with multi-turn adversarial conversations generated by human agents interacting with a dialog model and rated for safety by two corresponding diverse rater pools.
21
star
77

indic-gen-bench

IndicGenBench is a high-quality, multilingual, multi-way parallel benchmark for evaluating Large Language Models (LLMs) on 4 user-facing generation tasks across a diverse set 29 of Indic languages covering 13 scripts and 4 language families.
21
star
78

WebRED

WebRED is a large and diverse manually annotated dataset for extracting relationships from a variety of text found on the World Wide Web.
20
star
79

Video-Timeline-Tags-ViTT

A collection of videos annotated with timelines where each video is divided into segments, and each segment is labelled with a short free-text description
20
star
80

answer-equivalence-dataset

This dataset contains human judgements about answer equivalence. The data is based on SQuAD (Stanford Question Answering Dataset), and contains 9k human judgements of answer candidates generated by Albert on the SQuAD train set, and an additional 14k human judgements for answer candidates produced by BiDAF, Luke, and XLNet on the SQuAD dev set.
Jupyter Notebook
20
star
81

circa

Circa (meaning ‘approximately’) dataset aims to help machine learning systems to solve the problem of interpreting indirect answers to polar questions. The dataset contains pairs of yes/no questions and indirect answers, together with annotations for the interpretation of the answer. The data is collected in 10 different social conversation situations (eg. food preferences of a friend).
19
star
82

rico_semantics

Consists of ~500k human annotations on the RICO dataset identifying various icons based on their shapes and semantics, and associations between selected general UI elements and their text labels. Annotations also include human annotated bounding boxes which are more accurate and have a greater coverage of UI elements.
19
star
83

screen_annotation

The Screen Annotation dataset consists of pairs of mobile screenshots and their annotations. The annotations are in text format, and describe the UI elements present on the screen: their type, location, OCR text and a short description. It has been introduced in the paper `ScreenAI: A Vision-Language Model for UI and Infographics Understanding`.
18
star
84

distribution-over-quantities

18
star
85

birds-to-words

16
star
86

widget-caption

The dataset includes widget captions that describes UI element's functionalities. It is used for training and evaluation of the widget captioning model (please see the EMNLP'20 paper: https://arxiv.org/abs/2010.04295).
16
star
87

common-crawl-domain-names

Corpus of domain names scraped from Common Crawl and manually annotated to add word boundaries (e.g. "commoncrawl" to "common crawl").
16
star
88

2.5vrd

This dataset contains about 110k images annotated with the depth and occlusion relationships between arbitrary objects. It enables research on the 2.5D Visual Relationship Detection (2.5VRD) introduced in https://arxiv.org/abs/2104.12727.
15
star
89

DaTaSeg-Objects365-Instance-Segmentation

We release the DaTaSeg Objects365 Instance Segmentation Dataset introduced in the DaTaSeg paper, which can be used as an evaluation benchmark for weakly or semi supervised segmentation.
Jupyter Notebook
14
star
90

maverics

MAVERICS (Manually-vAlidated Vq^2a Examples fRom Image-Caption datasetS) is a suite of test-only benchmarks for visual question answering (VQA).
14
star
91

PropSegmEnt

PropSegmEnt is an annotated dataset for segmenting English text into propositions, and recognizing proposition-level entailment relations - whether a different, related document entails each proposition, contradicts it, or neither. It consists of clusters of closely related documents from the news and Wikipedia domains.
14
star
92

lareqa

LAReQA is a challenging benchmark for evaluating language agnostic answer retrieval from a multilingual candidate pool. This repository contains a dataset we release as part of the LAReQA evaluation.
14
star
93

recognizing-multimodal-entailment

The dataset consists of public social media url pairs and the corresponding entailment label for an external conference (ACL 2021). Each url contains a post with both linguistic (text) and visual (image) content. Entailment labels are human annotated through Google Crowdsource.
Jupyter Notebook
14
star
94

Textual-Entailment-New-Protocols

This data release is meant to accompany and document the paper: https://arxiv.org/abs/2004.11997 Collecting Entailment Data for Pretraining: New Protocols and Negative Results by Samuel R. Bowman, Jennimaria Palomaki, Livio Baldini Soares, and Emily Pitler
14
star
95

thesios

This repository describes I/O traces of Google storage servers and disks synthesized by Thesios. Thesios synthesizes representative I/O traces by combining down-sampled I/O traces collected from multiple disks (HDDs) attached to multiple storage servers in Google distributed storage system.
12
star
96

nlp-fairness-for-india

Contains data resources to replicate results from the paper “Re-contextualizing Fairness in NLP: The Case of India”.
11
star
97

adversarial-nibbler

This dataset contains results from all rounds of Adversarial Nibbler. This data includes adversarial prompts fed into public generative text2image models and validations for unsafe images. There will be two sets of data: all prompts submitted and all prompts attempted (sent to t2i models but not submitted as unsafe).
11
star
98

aquamuse

AQuaMuSe is a novel scalable approach to automatically mine dual query based multi-document summarization datasets for extractive and abstractive summaries using question answering dataset (Google Natural Questions) and large document corpora (Common Crawl)
11
star
99

maxm

MaXM is a suite of test-only benchmarks for multilingual visual question answering in 7 languages: English (en), French (fr), Hindi (hi), Hebrew (iw), Romanian (ro), Thai (th), and Chinese (zh).
11
star
100

sense-anaphora

Publicly released data: sense anaphora annotations.
10
star