• Stars
    star
    223
  • Rank 178,458 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 2 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Translation models for 22 scheduled languages of India

IndicTrans2

πŸ“œ Paper | 🌐 Website | ▢️ Demo | πŸ€— HF Inference

IndicTrans2 is the first open-source transformer-based multilingual NMT model that supports high-quality translations across all the 22 scheduled Indic languages β€” including multiple scripts for low-resouce languages like Kashmiri, Manipuri and Sindhi. It adopts script unification wherever feasible to leverage transfer learning by lexical sharing between languages. Overall, the model supports five scripts Perso-Arabic (Kashmiri, Sindhi, Urdu), Ol Chiki (Santali), Meitei (Manipuri), Latin (English), and Devanagari (used for all the remaining languages).

We open-souce all our training dataset (BPCC), back-translation data (BPCC-BT), final IndicTrans2 models, evaluation benchmarks (IN22, which includes IN22-Gen and IN22-Conv) and training and inference scripts for easier use and adoption within the research community. We hope that this will foster even more research in low-resource Indic languages, leading to further improvements in the quality of low-resource translation through contributions from the research community.

This code repository contains instructions for downloading the artifacts associated with IndicTrans2, as well as the code for training/fine-tuning the multilingual NMT models.

Here is the list of languages supported by the IndicTrans2 models:

Assamese (asm_Beng) Kashmiri (Arabic) (kas_Arab) Punjabi (pan_Guru)
Bengali (ben_Beng) Kashmiri (Devanagari) (kas_Deva) Sanskrit (san_Deva)
Bodo (brx_Deva) Maithili (mai_Deva) Santali (sat_Olck)
Dogri (doi_Deva) Malayalam (mal_Mlym) Sindhi (Arabic) (snd_Arab)
English (eng_Latn) Marathi (mar_Deva) Sindhi (Devanagari) (snd_Deva)
Konkani (gom_Deva) Manipuri (Bengali) (mni_Beng) Tamil (tam_Taml)
Gujarati (guj_Gujr) Manipuri (Meitei) (mni_Mtei) Telugu (tel_Telu)
Hindi (hin_Deva) Nepali (npi_Deva) Urdu (urd_Arab)
Kannada (kan_Knda) Odia (ory_Orya)

Updates

  • 🚨 Sep 9, 2023 - Added HF compatible IndicTrans2 models. Please refer to the README for detailed example usage.
  • 🚨 Dec 1, 2023 - Release of Indic-Indic model and corresponding distilled variants for each base model. Please refer to the Download section for the checkpoints.

Tables of Contents

Download Models and Other Artifacts

Multilingual Translation Models

Model En-Indic Indic-En Indic-Indic Evaluations
Base (used for benchmarking) download download download translations (as of May 10, 2023), metrics
Distilled download download download

Training Data

Data URL
Bharat Parallel Corpus Collection (BPCC) download
Back-translation (BPCC-BT) download

Evaluation Data

Data URL
IN22 test set download
FLORES-22 Indic dev set download

Installation

Instructions to setup and install everything before running the code.

# Clone the github repository and navigate to the project directory.
git clone https://github.com/AI4Bharat/IndicTrans2
cd IndicTrans2

# Install all the dependencies and requirements associated with the project.
source install.sh

Note: We recommend creating a virtual environment with python>=3.7.

Data

Training

Bharat Parallel Corpus Collection (BPCC) is a comprehensive and publicly available parallel corpus that includes both existing and new data for all 22 scheduled Indic languages. It is comprised of two parts: BPCC-Mined and BPCC-Human, totaling approximately 230 million bitext pairs. BPCC-Mined contains about 228 million pairs, with nearly 126 million pairs newly added as a part of this work. On the other hand, BPCC-Human consists of 2.2 million gold standard English-Indic pairs, with an additional 644K bitext pairs from English Wikipedia sentences (forming the BPCC-H-Wiki subset) and 139K sentences covering everyday use cases (forming the BPCC-H-Daily subset). It is worth highlighting that BPCC provides the first available datasets for 7 languages and significantly increases the available data for all languages covered.

You can find the contribution from different sources in the following table:

BPCC-Mined Existing Samanantar 19.4M
NLLB 85M
Newly Added Samanantar++ 121.6M
Comparable 4.3M
BPCC-Human Existing NLLB 18.5K
ICLI 1.3M
Massive 115K
Newly Added Wiki 644K
Daily 139K

Additionally, we provide augmented back-translation data generated by our intermediate IndicTrans2 models for training purposes. Please refer our paper for more details on the selection of sample proportions and sources.

English BT data (English Original) 401.9M
Indic BT data (Indic Original) 400.9M

Evaluation

IN22 test set is a newly created comprehensive benchmark for evaluating machine translation performance in multi-domain, n-way parallel contexts across 22 Indic languages. It has been created from three distinct subsets, namely IN22-Wiki, IN22-Web and IN22-Conv. The Wikipedia and Web sources subsets offer diverse content spanning news, entertainment, culture, legal, and India-centric topics. IN22-Wiki and IN22-Web have been combined and considered for evaluation purposes and released as IN22-Gen. Meanwhile, IN22-Conv the conversation domain subset is designed to assess translation quality in typical day-to-day conversational-style applications.

IN22-Gen (IN22-Wiki + IN22-Web) 1024 sentences πŸ€— ai4bharat/IN22-Gen
IN22-Conv 1503 sentences πŸ€— ai4bharat/IN22-Conv

You can download the data artifacts released as a part of this work from the following section.

Preparing Data for Training

BPCC data is organized under different subsets as described above, where each subset contains language pair subdirectories with the sentences pairs. We also provide LaBSE and LASER for the mined subsets of BPCC. In order to replicate our training setup, you will need to combine the data for corresponding language pairs from different subsets and remove overlapping bitext pairs if any.

Here is the expected directory structure of the data:

BPCC
β”œβ”€β”€ eng_Latn-asm_Beng
β”‚   β”œβ”€β”€ train.eng_Latn
β”‚   └── train.asm_Beng
β”œβ”€β”€ eng_Latn-ben_Beng
└── ...

While we provide deduplicated subsets with the current available benchmarks, we highly recommend performing deduplication using the combined monolingual side of all the benchmarks. You can use the following command for deduplication once you combine the monolingual side of all the benchmarks in the directory.

python3 scripts/dedup_benchmark.py <in_data_dir> <out_data_dir> <benchmark_dir>
  • <in_data_dir>: path to the directory containing train data for each language pair in the format {src_lang}-{tgt_lang}
  • <out_data_dir>: path to the directory where the deduplicated train data will be written for each language pair in the format {src_lang}-{tgt_lang}
  • <benchmark_dir>: path to the directory containing the language-wise monolingual side of dev/test set, with monolingual files named as test.{lang}

Using our SPM model and Fairseq dictionary

Once you complete the deduplication of the training data with the available benchmarks, you can preprocess and binarize the data for training models. Please download our trained SPM model and learned Fairseq dictionary using the following links for your experiments.

En-Indic Indic-En
SPM model download download
Fairseq dictionary download download

To prepare the data for training En-Indic model, please do the following:

  1. Download the SPM model in the experiment directory and rename it as vocab.
  2. Download the Fairseq dictionary in the experiment directory and rename it as final_dict.

Here is the expected directory for training En-Indic model:

en-indic-exp
β”œβ”€β”€ train
β”‚   β”œβ”€β”€ eng_Latn-asm_Beng
β”‚   β”‚   β”œβ”€β”€ train.eng_Latn
β”‚   β”‚   └── train.asm_Beng
β”‚   β”œβ”€β”€ eng_Latn-ben_Beng
β”‚   └── ...
β”œβ”€β”€ devtest
β”‚   └── all
β”‚       β”œβ”€β”€ eng_Latn-asm_Beng
β”‚       β”‚   β”œβ”€β”€ dev.eng_Latn
β”‚       β”‚   └── dev.asm_Beng
β”‚       β”œβ”€β”€ eng_Latn-ben_Beng
β”‚       └── ...
β”œβ”€β”€ vocab
β”‚   β”œβ”€β”€ model.SRC
β”‚   β”œβ”€β”€ model.TGT
β”‚   β”œβ”€β”€ vocab.SRC
β”‚   └── vocab.TGT
└── final_dict
    β”œβ”€β”€ dict.SRC.txt
    └── dict.TGT.txt

To prepare data for training the Indic-En model, you should reverse the language pair directories within the train and devtest directories. Additionally, make sure to download the corresponding SPM model and Fairseq dictionary and put them in the experiment directory, similar to the procedure mentioned above for En-Indic model training.

You can binarize the data for model training using the following:

bash prepare_data_joint_finetuning.sh <exp_dir>
  • <exp_dir>: path to the directory containing the raw data for binarization

You will need to follow the same steps for data preparation in case of fine-tuning models.

Training your own SPM models and learning Fairseq dictionary

If you want to train your own SPM model and learn Fairseq dictionary, then please do the following:

  1. Collect a balanced amount of English and Indic monolingual data (we use around 3 million sentences per language-script combination). If some languages have limited data available, increase their representation to achieve a fair distribution of tokens across languages.
  2. Perform script unification for Indic languages wherever possible using scripts/preprocess_translate.py and concatenate all Indic data into a single file.
  3. Train two SPM models, one for English and other for Indic side using the following:
spm_train --input=train.indic --model_prefix=<model_name> --vocab_size=<vocab_size> --character_coverage=1.0 --model_type=BPE
  1. Copy the trained SPM models in the experiment directory mentioned earlier and learn the Fairseq dictionary using the following:
bash prepare_data_joint_training.sh <exp_dir>
  1. You will need to use the same Fairseq dictionary for any subsequent fine-tuning experiments and refer to the steps described above (link).

Training / Fine-tuning

After binarizing the data, you can use train.sh to train the models. We provide the default hyperparameters used in this work. You can modify the hyperparameters as per your requirement if needed. If you want to train the model on a customized architecture, then please define the architecture in model_configs/custom_transformer.py. You can start the model training with the following command:

bash train.sh <exp_dir> <model_arch>
  • <exp_dir>: path to the directory containing the binarized data
  • <model_arch>: custom transformer architecture used for model training

For fine-tuning, the initial steps remain the same. However, the finetune.sh script includes an additional argument, pretrained_ckpt, which specifies the model checkpoint to be loaded for further fine-tuning. You can perform fine-tuning using the following command:

bash finetune.sh <exp_dir> <model_arch> <pretrained_ckpt>
  • <exp_dir>: path to the directory containing the binarized data
  • <model_arch>: custom transformer architecture used for model training
  • <pretrained_ckpt>: path to the fairseq model checkpoint to be loaded for further fine-tuning

You can download the model artifacts released as a part of this work from the following section.

The pretrained checkpoints have 3 directories, a fairseq model directory and 2 CT-ported model directories. Please note that the CT2 models are provided only for efficient inference. For fine-tuning purposes you should use the fairseq_model. Post that you can use the fairseq-ct2-converter to port your fine-tuned checkpoints to CT2 for faster inference.

Inference

Fairseq Inference

In order to run inference on our pretrained models using bash interface, please use the following:

bash joint_translate.sh <infname> <outfname> <src_lang> <tgt_lang> <ckpt_dir>
  • infname: path to the input file containing sentences
  • outfname: path to the output file where the translations should be stored
  • src_lang: source language
  • tgt_lang: target language
  • ckpt_dir: path to the fairseq model checkpoint directory

If you want to run the inference using python interface then please execute the following block of code from the root directory:

from inference.engine import Model

model = Model(ckpt_dir, model_type="fairseq")

sents = [sent1, sent2,...]

# for a batch of sentences
model.batch_translate(sents, src_lang, tgt_lang)

# for a paragraph
model.translate_paragraph(text, src_lang, tgt_lang)

CT2 Inference

In order to run inference on CT2-ported model using python inference then please execute the following block of code from the root directory:

from inference.engine import Model

model = Model(ckpt_dir, model_type="ctranslate2")

sents = [sent1, sent2,...]

# for a batch of sentences
model.batch_translate(sents, src_lang, tgt_lang)

# for a paragraph
model.translate_paragraph(text, src_lang, tgt_lang)

Evaluations

We consider the chrF++ as our primary metric. Additionally, we also report the BLEU and Comet scores. We also perform statistical significance tests for each metric to ascertain whether the differences are statistically significant.

In order to run our evaluation scripts, you will need to organize the evaluation test sets into the following directory structure:

eval_benchmarks
β”œβ”€β”€ flores
β”‚   └── eng_Latn-asm_Beng
β”‚       β”œβ”€β”€ test.eng_Latn
β”‚       └── test.asm_Beng
β”œβ”€β”€ in22-gen
β”œβ”€β”€ in22-conv
β”œβ”€β”€ ntrex
└── ...

To compute the BLEU and chrF++ scores for prediction file, you can use the following command:

bash compute_metrics.sh <pred_fname> <ref_fname> <tgt_lang>
  • pred_fname: path to the model translations
  • ref_fname: path to the reference translations
  • tgt_lang: target language

In order to automate the inference over the individual test sets for En-Indic, you can use the following command:

bash eval.sh <devtest_data_dir> <ckpt_dir> <system>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)
  • <ckpt_dir>: path to the fairseq model checkpoint directory
  • <system>: system name suffix to store the predictions in the format test.{lang}.pred.{system}

In case of Indic-En evaluation, please use the following command:

bash eval_rev.sh  <devtest_data_dir> <ckpt_dir> <system>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)
  • <ckpt_dir>: path to the fairseq model checkpoint directory
  • <system>: system name suffix to store the predictions in the format test.{lang}.pred.{system}

Note: You don’t need to reverse the test set directions for each language pair.

In case of Indic-Indic evaluation, please use the following command:

bash pivot_eval.sh <devtest_data_dir> <pivot_lang> <src2pivot_ckpt_dir> <pivot2tgt_ckpt_dir> <system>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)
  • <pivot_lang>: pivot language (default should be eng_Latn)
  • <src2pivot_ckpt_dir>: path to the fairseq Indic-En model checkpoint directory
  • <pivot2tgt_ckpt_dir>: path to the fairseq En-Indic model checkpoint directory
  • <system>: system name suffix to store the predictions in the format test.{lang}.pred.{system}

In order to perform significance testing for BLEU and chrF++ metrics after you have the predictions for different systems, you can use the following command:

bash compute_comet_metrics_significance.sh <devtest_data_dir>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)

Similarly, to compute the COMET scores and perform significance testing on predictions of different systems, you can use the following command.

bash compute_comet_score.sh <devtest_data_dir>
  • <devtest_data_dir>: path to the evaluation set with language pair subdirectories (for example, flores directory in the above tree structure)

Please note that as we compute significance tests with the same script and automate everything, it is best to have all the predictions for all the systems in place to avoid repeating anything. Also, we define the systems in the script itself, if you want to try out other systems, make sure to edit it there itself.

Baseline Evaluation

To generate the translation results for baseline models such as M2M-100, MBART, Azure, Google, and NLLB MoE, you can check the scripts provided in the "baseline_eval" directory of this repository. For NLLB distilled, you can either modify NLLB_MoE eval or use this repository. Similarly, for IndicTrans inference, please refer to this repository.

You can download the translation outputs released as a part of this work from the following section.

LICENSE

The following table lists the licenses associated with the different artifacts released as a part of this work:

Artifact LICENSE
Existing Mined Corpora (NLLB & Samanantar) CC0
Existing Seed Corpora (NLLB-Seed, ILCI, MASSIVE) CC0
Newly Added Mined Corpora (Samanantar++ & Comparable) CC0
Newly Added Seed Corpora (BPCC-H-Wiki & BPCC-H-Daily) CC-BY-4.0
Newly Created IN-22 test set (IN22-Gen & IN22-Conv) CC-BY-4.0
Back-translation data (BPCC-BT) CC0
Model checkpoints MIT

The mined corpora collection (BPCC-Mined), existing seed corpora (NLLB-Seed, ILCI, MASSIVE), Backtranslation data (BPCC-BT), are released under the following licensing scheme:

  • We do not own any of the text from which this data has been extracted.
  • We license the actual packaging of this data under the Creative Commons CC0 license (β€œno rights reserved”).
  • To the extent possible under law, AI4Bharat has waived all copyright and related or neighboring rights to BPCC-Mined, existing seed corpora (NLLB-Seed, ILCI, MASSIVE) and BPCC-BT.

Citation

@article{gala2023indictrans2,
  title   = {IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages},
  author  = {Jay Gala and Pranjal A. Chitale and Raghavan AK and Varun Gumma Sumanth Doddapaneni and and Aswanth Kumar and Janki Nawale and Anupama Sujatha and Ratish Puduppully and Vivek Raghavan and Pratyush Kumar and Mitesh M. Khapra and Raj Dabre and Anoop Kunchukuttan},
  year    = {2023},
  journal = {Transactions on Machine Learning Research},
  url     = {https://openreview.net/forum?id=vfT4YuzAYA}
}

More Repositories

1

indicnlp_catalog

A collaborative catalog of NLP resources for Indic languages
543
star
2

Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT
Python
273
star
3

indicnlp_corpus

Description Describes the IndicNLP corpus and associated datasets
Python
149
star
4

Indic-TTS

Text-to-Speech for languages of India
Jupyter Notebook
130
star
5

indicTrans

indicTranslate v1 - Machine Translation for 11 Indic languages. For latest v2, check: https://github.com/AI4Bharat/IndicTrans2
Jupyter Notebook
111
star
6

OpenHands

πŸ‘OpenHands : Making Sign Language Recognition Accessible. | **NOTE:** No longer actively maintained. If you are interested to own this and take it forward, please raise an issue
Python
97
star
7

Chitralekha

Chitralekha - A video transcreation platform for Indic languages, supporting transcription, translation and voice-over
95
star
8

IndicLLMSuite

A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages
Python
89
star
9

IndicWav2Vec

Pretraining, fine-tuning and evaluation scripts for Indic-Wav2Vec2
Jupyter Notebook
74
star
10

IndicXlit

Transliteration models for 21 Indic languages
Python
68
star
11

NPTEL2020-Indian-English-Speech-Dataset

NPTEL2020: Speech2Text dataset for Indian-English Accent
Python
68
star
12

IndicBERT

Pretraining, fine-tuning and evaluation scripts for IndicBERT-v2 and IndicXTREME
Python
65
star
13

IndicNLP-Transliteration

Codebase for Indic-Transliteration using Seq2Seq RNN. For latest repo with Transformer-based models, check: https://github.com/AI4Bharat/IndicXlit
Python
58
star
14

Shoonya

Shoonya - Platform to Annotate and label data at scale.
50
star
15

vistaar

Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR
Python
43
star
16

indic-bart

Pre-trained, multilingual sequence-to-sequence models for Indian languages
Python
43
star
17

Chitralekha-Backend

Transcribe your videos and translate it into Indic languages.
Python
27
star
18

Indic-Input-Tool-UI

Web Interface for Transliteration for Indic languages.
JavaScript
22
star
19

Shoonya-Backend

DRF-based API server for Shoonya platform
Python
20
star
20

Svarah

Swarah: Indian-English speech dataset collected across the country
Python
20
star
21

IndicVoices-R

A Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
19
star
22

FBI

FBI: Finding Blindspots in LLM Evaluations with Interpretable Checklists
Python
18
star
23

Shoonya-Frontend

JavaScript
16
star
24

Dhruva-Platform

Dhruva is an open-source platform for serving language AI models at scale.
TypeScript
15
star
25

indic-asr-api-backend

Indic-Conformer models for ASR
Python
13
star
26

INCLUDE

Code for INCLUDE paper with pre-trained models
Python
13
star
27

DocSim

Synthetically generate random text document images with ground-truth
Python
11
star
28

Fonts-for-Indian-Scripts

Font style transfer for Devanāgarī script using GANs
Python
10
star
29

aacl23-mnmt-tutorial

Additional resources from our AACL tutorial
10
star
30

adapter-efficiency

Python
10
star
31

IndicLID

Language Identification for Indian languages
Python
9
star
32

setu

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
HTML
9
star
33

speech-transcript-cleaning

Perform cleaning and normalization to standardize speech transcripts (train and test) across datasets.
Python
8
star
34

ezAnnotate

Annotation Platform for Machine Learning / Data Science, forked from DataTurks
JavaScript
7
star
35

Anudesh-Frontend

JavaScript
7
star
36

Chitralekha-Frontend

Frontend for Chitralekha platform
JavaScript
7
star
37

transactional-voice-ai

The code for transactional voice AI
Python
6
star
38

Indic-Glossary-Explorer

Glossary service for Indian languages
JavaScript
6
star
39

workshop-nlg-nlu-2022

Material for AI Workshop on Natural Language Understanding and Generation
6
star
40

indicnlp.ai4bharat.org

Archived old website for AI4Bhārat Indic-NLP
HTML
5
star
41

Chitralekha-Frontend-Lite

Lightweight version of Chitralekha
JavaScript
5
star
42

Indic-Glossaries

Collection of datasets for glossaries in Indian languages
4
star
43

CIA

Code for training, evaluating and using a cross-lingual Auto Evaluator
Python
4
star
44

sign-language.ai4bharat.org

Website for Indian Sign Language Recognition
4
star
45

INCLUDE-MS-Teams-Integration

An experimental Microsoft Teams integration of Sign Language models for word-level sign recognition
C#
4
star
46

Anudesh-Backend

Python
4
star
47

IndicMT-Eval

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation Metrics for Indian Languages, ACL 2023
HTML
4
star
48

IndicVoices

Jupyter Notebook
4
star
49

indic-numtowords

A simple lightweight library for text normalization for Indian Languages
Python
4
star
50

IndicSUPERB

Python
3
star
51

transactional-voice-ai_serving

Deployment code for all the Transactional Voice AI modules.
C++
3
star
52

CTQScorer

Python
3
star
53

Indic-Swipe

IndicSwipe is a collection of datasets and neural model architectures for decoding swipe gesture inputs on touch-based Indic language keyboards across 7 languages.
Python
3
star
54

Indic-OCR

2
star
55

DMU-DataDaan

Codebase for NLTM DMU's Data Upload System
JavaScript
2
star
56

2022.ai4bharat.org

Old website of AI4Bhārat using TinaCMS
JavaScript
2
star
57

setu-translate

Python
2
star
58

models.ai4bharat.org

A one stop platform to try out all the models built by the AI4Bharat team.
JavaScript
2
star
59

Shoonya-Frontend-Old

Old version of Shoonya UI. Latest repo: https://github.com/AI4Bharat/Shoonya-Frontend
JavaScript
2
star
60

Varnam-Transliteration-UI

Transliteration Web Interface
JavaScript
1
star
61

ai4b-website

TypeScript
1
star
62

Dhruva-Evaluation-Suite

A tool to perform functional testing and performance testing of the Dhruva Platform
Python
1
star
63

indicnlp_suite

Natural Language Understanding resources for Indian languages
1
star
64

Input-Tools-By-AI4bharat

Enhance your typing experience in Chrome with AI4Bharat's Input Tools Chrome extension. This extension provides real-time transliteration suggestions for Indian languages, offering seamless integration into your typing workflow.
JavaScript
1
star
65

Lahaja

This repository holds the artifacts of 'LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems'
1
star
66

Rasa

Expressive TTS Dataset for Assamese, Bengali, and Tamil.
Python
1
star
67

NeMo

Python
1
star
68

VocabAdaptation_LLM

Python
1
star