• Stars
    star
    239
  • Rank 168,763 (Top 4 %)
  • Language
  • Created almost 9 years ago
  • Updated almost 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TED parallel Corpora is growing collection of Bilingual parallel corpora, Multilingual parallel corpora and Monolingual corpora extracted from TED talks www.ted.com for 109 world languages.

TED-Parallel-Corpus

TED parallel Corpora is growing collection of Bilingual parallel corpora, Multilingual parallel corpora and Monolingual corpora extracted from TED talks www.ted.com for 109 world languages. It includes Monolingual corpus, 12 languages for Bilingual parallel corpus over 120 million aligned sentences and 13 languages for Multilingual Parallel corpus with more than 600k sentences. The goal of the extraction and processing was to generate sentence aligned text for statistical machine translation systems. All pre-processing is done automatically. No manual corrections have been carried out.

Author

Mr. Ajinkya kulkarni, Contact: [email protected]


Multilingual Parallel Corpus :


12 languages aligned Parallel corpus data : It contains Parallel aligned sentences for 12 languages which encovers ar Arabic, zh-cn Chinese, Simplified, zh-tw Chinese, Traditional, nl Dutch, fr French, de German, he Hebrew, it Italian, ja Japanese, ko Korean, ru Russian, es Spanish.

Sentences : 349049


4 languages aligned parallel corpus data: It contains Parallel aligned sentences for 4 South Asian languages which encovers zh-cn Chinese, Simplified, zh-tw Chinese, Traditional, ja Japanese, ko Korean.

Sentences : 389764


Bilingual Parallel Corpus :

Language 1 Language 2 Sentences Language 1 Language 2 Sentences
Russian Spanish 523485 Korean French 462616
Arabic Hebrew 512358 Korean Hebrew 485919
Dutch Russian 442167 Spanish Hebrew 486466
Arabic Russian 555618 Dutch Chinese, Traditional 406528
Hebrew Spanish 486466 Hebrew German 449485
Spanish Chinese, Simplified 479771 Hebrew French 464923
Spanish Russian 523485 Hebrew Italian 480730
Russian Chinese, Simplified 533541 Russian Italian 523015
German Chinese, Traditional 438420 Dutch Spanish 415347
Italian Spanish 477021 Chinese, Simplified Chinese, Traditional 464982
Spanish French 463476 Chinese, Simplified German 442415
Arabic Dutch 411929 Korean Spanish 486162
Chinese, Traditional Arabic 473423 Hebrew Dutch 415768
French Italian 458939 German Hebrew 449485
Russian Dutch 442167 Chinese, Traditional Italian 455363
Dutch Italian 407669 Arabic Italian 486628
Russian Arabic 555618 Arabic Chinese, Traditional 473423
Chinese, Traditional Spanish 465481 Chinese, Traditional Russian 506240
German Chinese, Simplified 442415 Spanish Dutch 415347
French German 442292 Dutch Hebrew 415768
Chinese, Simplified French 458083 Spanish German 452661
Arabic Spanish 491987 Russian Chinese, Traditional 506240
Chinese, Simplified Dutch 406971 Hebrew Chinese, Traditional 473169
German Arabic 445899 Arabic German 445899
German Dutch 411134 Chinese, Simplified Italian 473247
Italian Chinese, Simplified 473247 Arabic French 469558
Chinese, Traditional Dutch 406528 Hebrew Russian 541540
French Hebrew 464923 Italian Hebrew 480730
Hebrew Arabic 512358 French Arabic 469558
Chinese, Simplified Hebrew 496348 Russian Hebrew 541540
Hebrew Chinese, Simplified 496348 German Russian 479543
Chinese, Simplified Arabic 502194 Spanish Italian 477021
French Chinese, Traditional 448751 Dutch Arabic 411929
Italian German 444088 Chinese, Traditional German 438420
Dutch Chinese, Simplified 406971 Spanish Arabic 491987
Chinese, Traditional Hebrew 473169 Russian German 479543
German French 442292 Chinese, Traditional French 448751
Spanish Chinese, Traditional 465481 Spanish Korean 486162
Dutch German 411134 French Dutch 409715
Italian Chinese, Traditional 455363 Italian Dutch 407669
French Russian 500195 French Spanish 463476
German Spanish 452661 Russian French 500195
Chinese, Traditional Chinese, Simplified 464982 Italian Russian 523015
Arabic Chinese, Simplified 502194 German Italian 444088
French Chinese, Simplified 458083 Italian French 458939
Chinese, Simplified Spanish 479771 Chinese, Simplified Russian 533541
Hebrew Korean 485919 Dutch French 409715
French Korean 462616 Italian Arabic 486628

Monolingual Corpus :

Language Sentences Language Sentences Language Sentences
Azerbaijan 20852 Swahili 7204 Assamese 57
Chinese, Yue 20940 Czech 272464 Khmer 614
Latgalian 9 Silesian 91 Norwegian Nynorsk 3012
Chinese, Simplified 507085 Basque 12303 Occitan 54
Algerian Arabic 1716 Macedonian 69086 Hupa 3
Belarusian 10965 Montenegrin 4181 Danish 128916
Macedo 3068 Finnish 61604 Igbo 68
Croatian 326967 Hungarian 398138 Asturian 232
Malayalam 7218 Punjabi 48 Serbian 359791
Turkish 433023 Russian 609744 Irish 256
Bulgarian 475860 Bislama 49 Kazakh 6993
Tagalog 2397 Afrikaans 2903 Filipino 7513
Nepali 4350 French 493026 Icelandic 4957
Vietnamese 349731 German 471902 Mongolian 19737
Albanian 148541 Esperanto 18966 French (Canada) 68316
Slovak 175052 Georgian 37013 Telugu 4104
Maltese 343 Latin 46 Serbo-Croatian 5239
Swedish Chef 375 Cebuano 203 Tamil 20805
Somali 4545 Uyghur 1410 Bosnian 20522
Hindi 48513 Galician 22368 Slovenian 63981
Tibetan 2085 Romanian 454412 Indonesian 236543
Catalan 89358 Lao 854 Tatar 277
Ingush 377 Ukrainian 282163 Kyrgyz 1480
Tajik 1147 Kannada 3716 Hausa 51
Arabic 553483 Gujarati 5636 Klingon 131
Amharic 1596 Italian 501685 Dutch 433318
Latvian 60171 Marathi 22345 Swedish 121479
Estonian 33236 Lithuanian 116956 Sinhala 1602
Creole, Haitian 417 Malagasy 729 Persian 362411
Uzbek 6201 Bengali 17107 Hebrew 535665
Pashto 491 Armenian 69923
Spanish 521162 Luxembourgish 217
Thai 237086 Portuguese, Brazilian 476576
Burmese 41266 Urdu 19861
Portuguese 250967 Chinese, Traditional 483199
Norwegian Bokmal 47441 Malay 23502

Author

Mr. Ajinkya kulkarni, Contact: [email protected]


Conditions of use

The TED-Multilingual-Parallel-Corpus contain text from publicly accessible source www.ted.com . All data have been processed automatically so that it is not possible to reconstruct the original source texts. They are made available on the condition that they may be used for scientific purposes only and not passed on to third parties. Any use of the data must be duly documented and referenced.


Disclaimer

The TED-Multilingual-Parallel-Corpus have been processed automatically from www.ted.com . accessible sources based on the outlined methodology without considering in detail the content of the contained text. No responsibility is taken for the content of the data. In particular, the views and opinions expressed in specific parts of the data remain exclusively with the authors. For each word, the list of words that significantly co-occur with that word are computed on the basis of the available text and neither express a general fact of language nor the particular view of author for Natural Language Processing. Please let us know if you find problems with the data or if you want the data for other language pairs.


More Repositories

1

ERISHA

ERISHA is a mulitilingual multispeaker expressive speech synthesis framework. It can transfer the expressivity to the speaker's voice for which no expressive speech corpus is available.
Python
43
star
2

How-I-Extracted-TED-talks-for-parallel-Corpus-

Jupyter Notebook
34
star
3

NAR_TTS_samples_interspeech_2022

Non-autoregressive TTS systems for expressivity transfer. Results submitted for Interspeech 2022
6
star
4

RNN_Machine_Transliteration-

Jupyter Notebook
5
star
5

Textgrid-Parser

Python
4
star
6

Audio-Book-Corpus-for-European-Languages-

Audio Book Corpus (ABC) project has been developed to aid linguist researchers in the field of text to speech for purely academic purposes. In the current form, the corpus consists approximately 200 minutes of speech data in German language. Besides German, we are also in the process of developing Corpus Portuguese and Italian langugae. Future versions of the corpus shall encompass most European languages such as French, Spanish, Czech, Dutch, Polish, Romanian.
Jupyter Notebook
3
star
7

lightning-ssl

Self-Supervised methods implemented with PyTorch Lightning
Jupyter Notebook
1
star
8

Recurrent-Neural-Network-for-Language-Identification

The first goal for Language identification is to build a classifier which can convert from a sequence of characters into a classification score for languages. Suppose that we have an input sequence x (text data) and a desired output is y (Language ID). For creation of training corpus, Leipzig corpus extracted and cleaned. Afterwards, each characters in sentences is mapped to unique character id. For RNN - LSTM architecture training, sequence of character ids as inputs and output is class labels of language. We used 30K sentences for each language with 2 hidden layers of 200 nodes. It took 5 days to train the network with error rate of 3.48% for 9 European languages.
HTML
1
star