• Stars
    star
    797
  • Rank 57,151 (Top 2 %)
  • Language Makefile
  • License
    Other
  • Created over 4 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The Tatoeba Translation Challenge (v2021-08-07)

This is a challenge set for machine translation that contains 29G translation units in 3,708 bitexts covering 557 languages. The package includes a release of 631 test sets derived from Tatoeba.org that cover 134 languages.

NMT map

Tasks

Downloads

The latest release also includes some parallel data sets in the same language in order to test paraphrase models. Note, however, that the support for paraphrasing is really limited in our data sets.

In more detail

This package provides data sets for machine translation in many languages with test data taken from Tatoeba.

The Tatoeba translation challenge includes shuffled training data taken from OPUS and test data from Tatoeba via the aligned data set in OPUS. All data sets are normalised to ISO-639-3 language codes (so much as possible) using macro-languages in case there are various individual sub-languages available. Naturally, training data do not include Tatoeba sentences and the popular WMT testsets are not included to allow a fair comparison to other models using those data sets.

This is an open challenge and the idea s to encourage people to develop machine translation in real-world cases for many languages. The most important point is to get away from artificial settings that simulate low-resource scenarios or zero-shot translations. Here, we extracted data sets with all the data we have in a large collection of parallel corpora instead and do not reduce high-resource scenarios in an unnatural way. Tatoeba is, admittedly, a rather easy test set in general but it includes a wide varity of languages and makes it easy to get started with rather encouraging results even for lesser resourced languages. The release also includes medium and high resource settings and allows a wide range of experiments with all supported language pairs including studies of transfer learning and pivot-based methods.

Please, cite the following paper if you use data and models from this distribution:

@inproceedings{tiedemann-2020-tatoeba,
    title = "The {T}atoeba {T}ranslation {C}hallenge {--} {R}ealistic Data Sets for Low Resource and Multilingual {MT}",
    author = {Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Fifth Conference on Machine Translation",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.wmt-1.139",
    pages = "1174--1182"
}

Data releases

The current release includes data for 3,708 language pairs covering 557 languages. The data sets are released per language pair with the following structure (using deu-eng as an example):

data/deu-eng/
data/deu-eng/train.src.gz
data/deu-eng/train.trg.gz
data/deu-eng/train.id.gz
data/deu-eng/dev.id
data/deu-eng/dev.src
data/deu-eng/dev.trg
data/deu-eng/test.src
data/deu-eng/test.trg
data/deu-eng/test.id

Files with the extension .src refer to sentences in the source language (deu in this case) and files with extension .trg contain sentences in the target laguage (eng here). File with extension .id include the ISO-639-3 language labels with possibly extensions about the orthographic script and information about regional variants. In the .id file for the training data there are also labels for the OPUS corpus the sentences come from.

Other notes about the compilation of the data sets can be found in Development.md and the complete lists of language pairs is in data/README.md.

New releases are planned in the future and will be announced here. Development and test data will be updated regularly but the original test sets will stay in the release. Updates of the test data will be available through this devtest release and will not include any examples available in development data. Those data sets are also available from this git repository in the sub directory data/devtest/.

The translation challenge

The main challenge is to develop translation models and to test them with the given test data from Tatoeba. The focus is on low-resource languages and to push their coverage and translation quality. Resources for high-resource are also provided and can be used as well for translation modelling of those languages and for knowledge transfer to less resourced languages. Note that not all language pairs have sufficient data sets for test, development (dev) and training (train) data. Hence, we divided the Tatoeba challenge data into various subsets:

For all those selected language pairs, the data set provides at least 200 sentences per test set. Note, that everything below 1,000 sentences is probably not very reliable as a proper test set but, here we go, what can we do for real-world cases of low-resource languages?

The data challenge

The most important ingredient for improved translation quality is data. It is not only about training data but very much also about appropriate test data that can help to push the development of transfer models and other ideas of handling low-resource settings. Therefore, another challenge we want to open here is to increase the coverage of test sets for low-resource languages. This challenge is really important and contributions are necessary. The approach here would be to directly contribute translations for your favorite language directly to the Tatoeba data collection. The new translations will make their way into the data set here through OPUS! Make an effort and provide new data already today!

We also encourage to incorporate other test sets besides of the Tatoeba data. Raise an issue in the issue tracker if you want to propose / provide additional test data. This is especially interesting for less-common language combinations. Please, contribute!

Results and models

There are some initial baseline results for parts of the data set using the setup of OPUS-MT but running on Tatoeba MT challenge data (see also OPUS-MT-TatoebaChallenge). Note that we include results from previous releases and other common test sets as well

Challenge subset results (v2021-08-07):

Challenge subset results (v2020-07-28):

We publish (reasonable) models to be re-used and deployed through OPUS-MT and linked from the model subdir in this github. This includes multilingual models that cover several languages in source and target to enable transfer learning across languages.

How to participate

Everyone interested is free to use the data for their own development. Naturally, we encourage contributions by the community and will develop a leader board for individual language pairs. The idea is also to make pre-trained models available in order to support re-use and replciability. Consider, for example, to contribute to OPUS-MT or to upload models to the model hub at huggingface (like translation models from Helsinki-NLP).

Certain rules apply:

  • Don't use any dev or test data for training (dev can be used for validation during training as an early stopping criterion).
  • Only use the provided training data for training models with comparable results in constrained settings. Any combination of language pairs is fine or backtranslation of sentences included in training data for any language pair is allowed, too. That means that additional data sets, parallel or monolingual, are not allowed for official models to be compared with others. Unconstrained models may also be trained and can be reported as a separate category.
  • Using pre-trained language or translation models fall into the unconstrained category. Make sure that the pre-trained model does not include Tatoeba data that we reserve for testing! Note that current OPUS-MT models can not be used as they contain Tatoeba data that may overlap with the test data in this release!
  • We encourage to make the models available through OPUS-MT or other public means. This ensures replicability and re-use of pre-trained models! If you want to enter the official leader board you must have to make your model available including instructions on how to use them!

Don't hesitate to contact us in case of questions and suggestions. Thanks for your contributions and enjoy!

Note on language labels

The labels are converted from the original OPUS language IDs (which are mostly ISO-639-1) and information about the script is automatically assigned using Unicode regular expressions and counting letters from specific script character properties. Only the most frequently present script is shown. Be aware of mixed content and possible mistakes in the assignment. Note that the code Zyyy refers to common characters that cannot be used to distinguish scripts. The script code is not added if there is only one script in that language and no other scripts are detected in the string. If there is a default script among several alternatives then this script is not shown either. Note that the assignment is done fully automatically and no corrections have been made. This may go wrong for several reasons. For illustration, here is an example for Serbo-Croatian languages and Chinese from the Tatoeba test data:

bos_Latn        cmn_Hani        ลฝelim da mi ti kaลพeลก istinu.    ๆˆ‘ๆƒณไฝ ๆŠŠ็œŸ็›ธๅ‘Š่ฏ‰ๆˆ‘ใ€‚
hrv     cmn_Hani        Molim Vas odgovorite na moje pitanje.   ่ฏทๅ›ž็ญ”ๆˆ‘็š„้—ฎ้ข˜ใ€‚
hrv     cmn_Hani        Hvala ti, ne bih to mogao bez tebe.     ๆฒกๆœ‰ไฝ ๆˆ‘ๆ— ๆณ•ๅšๅˆฐ๏ผŒ่ฐข่ฐขใ€‚
hrv     cmn_Hani        Ti si moja majka.       ไฝ ๆ˜ฏๆˆ‘ๅฆˆๅฆˆใ€‚
srp_Cyrl        cmn_Kana        ะขะพ ั˜ะต ะผะพั˜ะฐ ะผะฐั‡ะบะฐ.       ้‚ฃๆ˜ฏๆˆ‘็š„็Œซใ€‚
hrv     cmn_Yiii        Bok.    ไฝ ๅฅฝใ€‚

Note on test and development data

Test and development data are taken from a shuffled version of Tatoeba. All translation alternatives are included in the data set to obtain the best coverage of languages in the collection. Development and test sets are disjoint in the sense that they do not include identical source-target language sentence pairs. However, there can be identical source sentences or identical target sentences in both sets, which are not linked to the same translations. Similarily, there can be identical source or target sentences in one of the sets, for example the test set, with different translations. Below, you can see examples from the Esperanto-Ladino test set:

epo     lad     Kio estas vorto?        ืงื™ ืื™ืก ืื•ืŸ ื‘ื™ื™ืจื‘๏ฌžื•?
epo     lad     Kio estas vorto?        ืงื™ ืื™ืก ืื•ื ื” ืคืืœืื‘๏ฌžืจื”?
epo     lad_Latn        ฤˆu vi estas en Berlino? Estash en Berlin?
epo     lad_Latn        ฤˆu vi estas en Berlino? Vos estash en Berlin?
epo     lad_Latn        ฤˆu vi estas en Berlino? Vozotras estash en Berlin?
epo     lad_Latn        La hundo estas nigra.   El perro es preto.
epo     lad_Latn        La hundo nigras.        El perro es preto.

The test data could have been organized as multi-reference data sets but this would require to provide different sets in both translation directions. Removing alternative translations is also not a good option as this would take away a lot of relevant data. Hence, we decided to provide the data sets as they are, which implicitly creates multi-reference test sets but with the wrong normalization.

License

These data are released under this licensing scheme:

CC-BY-NC-SA CC-BY-NC-SA 4.0 license

Notice and take down policy

Notice: Should you consider that our data contains material that is owned by you and should therefore not be reproduced here, please:

  • Clearly identify yourself, with detailed contact data such as an address, telephone number or email address at which you can be contacted.
  • Clearly identify the copyrighted work claimed to be infringed.
  • Clearly identify the material that is claimed to be infringing and information reasonably sufficient to allow us to locate the material.
  • And contact Jรถrg Tiedemann at the following email address: jorg DOT tiedemann AT helsinki.fi.

Take down: We will comply to legitimate requests by removing the affected sources from the next release of the data set.

More Repositories

1

Opus-MT

Open neural machine translation models and web services
Python
612
star
2

OPUS-MT-train

Training open neural machine translation models
Makefile
330
star
3

prosody

Helsinki Prosody Corpus and A System for Predicting Prosodic Prominence from Text
Python
229
star
4

OpusFilter

OpusFilter - Parallel corpus processing toolkit
Python
101
star
5

HBMP

Sentence Embeddings in NLI with Iterative Refinement Encoders
Python
78
star
6

OPUS-CAT

OPUS-CAT is a collection of software which make it possible to OPUS-MT neural machine translation models in professional translation. OPUS-CAT includes a local offline MT engine and a collection of CAT tool plugins.
C#
69
star
7

OpusTools

Python
67
star
8

XED

XED multilingual emotion datasets
Jupyter Notebook
55
star
9

OPUS

The Open Parallel Corpus
JavaScript
54
star
10

UkrainianLT

A collection of links to Ukrainian language tools
29
star
11

OPUS-translator

Translation demonstrator
Smalltalk
27
star
12

mammoth

MAMMOTH: MAssively Multilingual Modular Open Translation @ Helsinki
Python
21
star
13

MuCoW

Automatically harvested multilingual contrastive word sense disambiguation test sets for machine translation
Python
16
star
14

subalign

Perl
15
star
15

sentimentator

Tool for sentiment analysis annotation
HTML
11
star
16

OPUS-MT-testsets

benchmarks for evaluating MT models
Smalltalk
10
star
17

OpusTools-perl

Perl
6
star
18

neural-search-tutorials

Additional Notebooks for the Building NLP Applications course
Jupyter Notebook
5
star
19

OPUS-interface

OPUS repository interface
Python
5
star
20

OPUS-ingest

Makefile
4
star
21

LanguageCodes

Perl
4
star
22

shroom

SCSS
4
star
23

nli-data-sanity-check

Data and scripts for a diagnostics test suite which allows to assess whether an NLU dataset constitutes a good testbed for evaluating the models' meaning understanding capabilities.
Jupyter Notebook
4
star
24

OPUS-repository

Perl
3
star
25

doclevel-MT-benchmark

Document-level Machine Translation Benchmark
Python
3
star
26

Uplug

HTML
3
star
27

americasnlp2021-st

AmericasNLP 2021 shared task
JavaScript
3
star
28

Geometry

Python
2
star
29

shared-info

2
star
30

LSDC

Low-Saxon Dialect Classification
2
star
31

pdf2xml

Perl
2
star
32

Syntactic_Debiasing

Python
2
star
33

OpusTranslationService

Translation service based on LibreTranslate
Python
2
star
34

murre24

Manually annotated dataset of Finnish varieties in the Suomi24, the largest Finnish internet forum, the id's of automatically annotated dialectal messages and the scripts used for classification and evaluation.
Python
2
star
35

OPUS-index

Index of resources in OPUS
1
star
36

OpusFilter-hub

A hub of OpusFilter configurations
Python
1
star
37

NLU-Course-2020

Python
1
star
38

SELF-FEIL

Emotion Lexicons for Finnish
1
star
39

ndc-aligned

Word-aligned version of the Norwegian Dialect Corpus
Python
1
star
40

OPUS-MT-dashboard

PHP
1
star
41

External-MT-leaderboard

Leaderboards for external MT models
1
star
42

nlu-dataset-diagnostics

This repository contains data and scripts to reproduce the results from our paper: How Does Data Corruption Affect Natural Language Understanding Models? A Study on GLUE datasets.
Python
1
star
43

en-fi-testsuite

WMT18 Testsuite for Finnish morphology
Python
1
star
44

finlandsvensk-AI

1
star
45

OPUS-website

OPUS website files
1
star
46

OPUS-MT-leaderboard-recipes

Makefile recipes shared between all leaderboard repos
Makefile
1
star
47

OPUS-MT-leaderboard

1
star
48

murreviikko

Dialectologically annotated and normalized dataset of dialectal Finnish tweets
Python
1
star
49

Sami-MT

machine translation for Sรกmi languages
1
star
50

lm-vs-mt

Two Stacks Are Better Than One: A Comparison of Language Modeling and Translation as Multilingual Pretraining Objectives
Python
1
star
51

OPUS-API

API for searching corpora from OPUS
Python
1
star
52

dialect-topic-model

Scripts and metadata for the paper "Corpus-based dialectometry with topic models"
Python
1
star