TED-Parallel-Corpus
TED parallel Corpora is growing collection of Bilingual parallel corpora, Multilingual parallel corpora and Monolingual corpora extracted from TED talks www.ted.com for 109 world languages. It includes Monolingual corpus, 12 languages for Bilingual parallel corpus over 120 million aligned sentences and 13 languages for Multilingual Parallel corpus with more than 600k sentences. The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. All pre-processing is done automatically. No manual corrections have been carried out.
Author
Mr. Ajinkya kulkarni, Contact: [email protected]
Multilingual Parallel Corpus :
12 languages aligned Parallel corpus data : It contains Parallel aligned sentences for 12 languages which encovers ar Arabic, zh-cn Chinese, Simplified, zh-tw Chinese, Traditional, nl Dutch, fr French, de German, he Hebrew, it Italian, ja Japanese, ko Korean, ru Russian, es Spanish.
Sentences : 349049
4 languages aligned parallel corpus data: It contains Parallel aligned sentences for 4 South Asian languages which encovers zh-cn Chinese, Simplified, zh-tw Chinese, Traditional, ja Japanese, ko Korean.
Sentences : 389764
Bilingual Parallel Corpus :
Language 1 | Language 2 | Sentences | Language 1 | Language 2 | Sentences |
---|---|---|---|---|---|
Russian | Spanish | 523485 | Korean | French | 462616 |
Arabic | Hebrew | 512358 | Korean | Hebrew | 485919 |
Dutch | Russian | 442167 | Spanish | Hebrew | 486466 |
Arabic | Russian | 555618 | Dutch | Chinese, Traditional | 406528 |
Hebrew | Spanish | 486466 | Hebrew | German | 449485 |
Spanish | Chinese, Simplified | 479771 | Hebrew | French | 464923 |
Spanish | Russian | 523485 | Hebrew | Italian | 480730 |
Russian | Chinese, Simplified | 533541 | Russian | Italian | 523015 |
German | Chinese, Traditional | 438420 | Dutch | Spanish | 415347 |
Italian | Spanish | 477021 | Chinese, Simplified | Chinese, Traditional | 464982 |
Spanish | French | 463476 | Chinese, Simplified | German | 442415 |
Arabic | Dutch | 411929 | Korean | Spanish | 486162 |
Chinese, Traditional | Arabic | 473423 | Hebrew | Dutch | 415768 |
French | Italian | 458939 | German | Hebrew | 449485 |
Russian | Dutch | 442167 | Chinese, Traditional | Italian | 455363 |
Dutch | Italian | 407669 | Arabic | Italian | 486628 |
Russian | Arabic | 555618 | Arabic | Chinese, Traditional | 473423 |
Chinese, Traditional | Spanish | 465481 | Chinese, Traditional | Russian | 506240 |
German | Chinese, Simplified | 442415 | Spanish | Dutch | 415347 |
French | German | 442292 | Dutch | Hebrew | 415768 |
Chinese, Simplified | French | 458083 | Spanish | German | 452661 |
Arabic | Spanish | 491987 | Russian | Chinese, Traditional | 506240 |
Chinese, Simplified | Dutch | 406971 | Hebrew | Chinese, Traditional | 473169 |
German | Arabic | 445899 | Arabic | German | 445899 |
German | Dutch | 411134 | Chinese, Simplified | Italian | 473247 |
Italian | Chinese, Simplified | 473247 | Arabic | French | 469558 |
Chinese, Traditional | Dutch | 406528 | Hebrew | Russian | 541540 |
French | Hebrew | 464923 | Italian | Hebrew | 480730 |
Hebrew | Arabic | 512358 | French | Arabic | 469558 |
Chinese, Simplified | Hebrew | 496348 | Russian | Hebrew | 541540 |
Hebrew | Chinese, Simplified | 496348 | German | Russian | 479543 |
Chinese, Simplified | Arabic | 502194 | Spanish | Italian | 477021 |
French | Chinese, Traditional | 448751 | Dutch | Arabic | 411929 |
Italian | German | 444088 | Chinese, Traditional | German | 438420 |
Dutch | Chinese, Simplified | 406971 | Spanish | Arabic | 491987 |
Chinese, Traditional | Hebrew | 473169 | Russian | German | 479543 |
German | French | 442292 | Chinese, Traditional | French | 448751 |
Spanish | Chinese, Traditional | 465481 | Spanish | Korean | 486162 |
Dutch | German | 411134 | French | Dutch | 409715 |
Italian | Chinese, Traditional | 455363 | Italian | Dutch | 407669 |
French | Russian | 500195 | French | Spanish | 463476 |
German | Spanish | 452661 | Russian | French | 500195 |
Chinese, Traditional | Chinese, Simplified | 464982 | Italian | Russian | 523015 |
Arabic | Chinese, Simplified | 502194 | German | Italian | 444088 |
French | Chinese, Simplified | 458083 | Italian | French | 458939 |
Chinese, Simplified | Spanish | 479771 | Chinese, Simplified | Russian | 533541 |
Hebrew | Korean | 485919 | Dutch | French | 409715 |
French | Korean | 462616 | Italian | Arabic | 486628 |
Monolingual Corpus :
Language | Sentences | Language | Sentences | Language | Sentences |
---|---|---|---|---|---|
Azerbaijan | 20852 | Swahili | 7204 | Assamese | 57 |
Chinese, Yue | 20940 | Czech | 272464 | Khmer | 614 |
Latgalian | 9 | Silesian | 91 | Norwegian Nynorsk | 3012 |
Chinese, Simplified | 507085 | Basque | 12303 | Occitan | 54 |
Algerian Arabic | 1716 | Macedonian | 69086 | Hupa | 3 |
Belarusian | 10965 | Montenegrin | 4181 | Danish | 128916 |
Macedo | 3068 | Finnish | 61604 | Igbo | 68 |
Croatian | 326967 | Hungarian | 398138 | Asturian | 232 |
Malayalam | 7218 | Punjabi | 48 | Serbian | 359791 |
Turkish | 433023 | Russian | 609744 | Irish | 256 |
Bulgarian | 475860 | Bislama | 49 | Kazakh | 6993 |
Tagalog | 2397 | Afrikaans | 2903 | Filipino | 7513 |
Nepali | 4350 | French | 493026 | Icelandic | 4957 |
Vietnamese | 349731 | German | 471902 | Mongolian | 19737 |
Albanian | 148541 | Esperanto | 18966 | French (Canada) | 68316 |
Slovak | 175052 | Georgian | 37013 | Telugu | 4104 |
Maltese | 343 | Latin | 46 | Serbo-Croatian | 5239 |
Swedish Chef | 375 | Cebuano | 203 | Tamil | 20805 |
Somali | 4545 | Uyghur | 1410 | Bosnian | 20522 |
Hindi | 48513 | Galician | 22368 | Slovenian | 63981 |
Tibetan | 2085 | Romanian | 454412 | Indonesian | 236543 |
Catalan | 89358 | Lao | 854 | Tatar | 277 |
Ingush | 377 | Ukrainian | 282163 | Kyrgyz | 1480 |
Tajik | 1147 | Kannada | 3716 | Hausa | 51 |
Arabic | 553483 | Gujarati | 5636 | Klingon | 131 |
Amharic | 1596 | Italian | 501685 | Dutch | 433318 |
Latvian | 60171 | Marathi | 22345 | Swedish | 121479 |
Estonian | 33236 | Lithuanian | 116956 | Sinhala | 1602 |
Creole, Haitian | 417 | Malagasy | 729 | Persian | 362411 |
Uzbek | 6201 | Bengali | 17107 | Hebrew | 535665 |
Pashto | 491 | Armenian | 69923 | ||
Spanish | 521162 | Luxembourgish | 217 | ||
Thai | 237086 | Portuguese, Brazilian | 476576 | ||
Burmese | 41266 | Urdu | 19861 | ||
Portuguese | 250967 | Chinese, Traditional | 483199 | ||
Norwegian Bokmal | 47441 | Malay | 23502 |
Author
Mr. Ajinkya kulkarni, Contact: [email protected]
Conditions of use
The TED-Multilingual-Parallel-Corpus contain text from publicly accessible source www.ted.com . All data have been processed automatically so that it is not possible to reconstruct the original source texts. They are made available on the condition that they may be used for scientific purposes only and not passed on to third parties. Any use of the data must be duly documented and referenced.
Disclaimer
The TED-Multilingual-Parallel-Corpus have been processed automatically from www.ted.com . accessible sources based on the outlined methodology without considering in detail the content of the contained text. No responsibility is taken for the content of the data. In particular, the views and opinions expressed in specific parts of the data remain exclusively with the authors. For each word, the list of words that significantly co-occur with that word are computed on the basis of the available text and neither express a general fact of language nor the particular view of author for Natural Language Processing. Please let us know if you find problems with the data or if you want the data for other language pairs.