• Stars
    star
    543
  • Rank 81,823 (Top 2 %)
  • Language
  • Created about 5 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A collaborative catalog of NLP resources for Indic languages

๐Ÿ”– The Indic NLP Catalog

A Collaborative Catalog of Resources for Indic Language NLP

The Indic NLP Catalog repository is an attempt to collaboratively build the most comprehensive catalog of NLP datasets, models and other resources for all languages of the Indian subcontinent.

Please suggest any other resources you may be aware of. Raise a pull request or an issue to add more resources to the catalog. Put the proposed entry in the following format:

[Wikipedia Dumps](https://dumps.wikimedia.org/)

Add a small, informative description of the dataset and provide links to any paper/article/site documenting the resource. Mention your name too. We would like to acknowlege your contribution to building this catalog in the CONTRIBUTORS list.

๐Ÿ‘ Featured Resources

Indian language NLP has come a long way. We feature a few resources that are illustrative of the trends in recent times along various axes and point to a bright future.

  • Universal Language Contribution API (ULCA): ULCA is a standard API and open scalable data platform (supporting various types of datasets) for Indian language datasets and models. ULCA is part of the Bhasini mission. You can upload, discover models, datasets and benchmarks here. This is one repository we really need and hope to see this evolving into a standard, large-scale platform for resource discovery and dissemination.
  • We are seeing the rise of large-scale datasets across many tasks like IndicCorp (text corpus/9 billion tokens), Samanantar (parallel corpus/50 million sentence pairs), Naamapadam (named entity/5.7 million sentences), HiNER (named entity/100k sentences), Aksharantar (transliteration/26 million pairs) , etc. These are being built using either large-scale mining of web-resource or large human annotation efforts or both.
  • As we aim higher, the datasets and models are achieving higher language coverage. While earlier datasets would be available for only a handful of Indian languages, then for 10-12 languages - we are now reaching the next frontier where we are creating resources like Aksharantar (transliteration/21 languages), FLORES-200 (translation/27 languages), IndoWordNet (wordnet/18 languages) spanning almost all languages listed in the Indian constitution and more. Datasets and models spanning a large number of languages.
  • Particularly, we are seeing datasets getting created for extremely low-resourced languages or languages not yet covered in any dataset like Bodo, Kangri, Khasi, etc.
  • From a handful of institutes who pioneered the development of NLP in India, we now have an increasing number of institutes/interest groups and passionate volunteers like AI4Bharat, BUET CSE NLP, KMI, L3Cube, iNLTK, IIT Patna, etc. who are contributing to building resources for Indian languages.

Browse the entire catalog...

๐Ÿ™‹Note: Many known resources have not yet been classified into the catalog. They can be found as open issues in the repo.

Major Indic Language NLP Repositories

Libraries and Tools

  • Indic NLP Library: Python Library for various Indian language NLP tasks like tokenization, sentence splitting, normalization, script conversion, transliteration, etc
  • pyiwn: Python Interface to IndoWordNet
  • Indic-OCR : OCR for Indic Scripts
  • CLTK: Toolkit for many of the world's classical languages. Support for Sanskrit. Some parts of the Sanskrit library are forked from the Indic NLP Library.
  • iNLTK: iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages.
  • Sanskrit Coders Indic Transliteration: Script conversion and romanization for Indian languages.
  • Smart Sanskirt Annotator: Annotation tool for Sanskrit paper
  • BNLP: Bengali language processing toolkit with tokenization, embedding, POS tagging, NER suppport
  • CodeSwitch: Language identification, POS Tagging, NER, sentiment analysis support for code mixed data including Hindi and Nepali language

Evaluation Benchmarks

Benchmarks spanning multiple tasks.

  • AI4Bharat IndicGLUE: NLU benchmark for 11 languages.
  • AI4Bharat IndicNLG Suite: NLG benchmark for 11 languages spanning 5 generation tasks: biography generation, sentence summarization, headline generation, paraphrase generation and question generation.
  • GLUECoS: For Hindi-English code-mixed benchmark containing the following tasks - Language Identification (LID), POS Tagging (POS), Named Entity Recognition (NER), Sentiment Analysis (SA), Question Answering (QA), Natural Language Inference (NLI).
  • AI4Bharat Text Classification: A compilation of classification datasets for 10 languages.
  • WAT 2021 Translation Dataset: Standard train and test sets for translation between English and 10 Indian languages.

Standards

Text Corpora

Monolingual Corpus

Language Identification

Lexical Resources and Semantic Similarity

NER Corpora

Parallel Translation Corpus

Parallel Transliteration Corpus

Text Classification

Textual Entailment/Natural Language Inference

Paraphrase

Sentiment, Sarcasm, Emotion Analysis

Hate Speech and Offensive Comments

Question Answering

  • Facebook Multilingual QA datasets: Contains dev and test sets for Hindi.
  • TyDi QA datasets: QA dataset for Bengali and Telugu.
  • bAbi 1.2 dataset: Has Hindi version of bAbi tasks in romanized Hindi.
  • MMQA dataset: Hindi QA dataset described in this paper
  • XQuAD: testset for Hindi QA from human translation of subset of SQuAD v1.1. Described in this paper
  • XQA: testset for Tamil QA. Described in this paper
  • HindiRC: A Dataset for Reading Comprehension in Hindi containing 127 questions and 24 passages. Described in this paper
  • IITH HiDG: A Distractor Generation Dataset for Hindi consisting of 1k/1k/5k (train/validation/test) split. Described in this paper
  • Chaii a Kaggle challenge which consists of 1104 Questions in Hindi and Tamil. Moreover, here is a good collection of papers on multilingual Question Answering.
  • csebuetnlp Bangla QA: A Question Answering (QA) dataset for Bengali. Described in this paper.
  • XOR QA: A large-scale cross-lingual open-retrieval QA dataset (includes Bengali and Telugu) with 40k newly annotated open-retrieval questions that cover seven typologically diverse languages. Described in this paper. More information is available here.
  • IITB HiQuAD: A question answering dataset in Hindi consisting of 6555 question-answer pairs. Described in this paper.

Dialog

Discourse

Information Extraction

  • EventXtract-IL: Event extraction for Tamil and Hindi. Described in this paper.
  • [EDNIL-FIRE2020]https://ednilfire.github.io/ednil/2020/index.html): Event extraction for Tamil, Hindi, Bengali, Marathi, English. Described in this paper.
  • Amazon MASSIVE: A Multilingual Amazon SLURP (SLU resource package) for Slot Filling, Intent Classification, and Virtual-Assistant Evaluation containing one million realistic, parallel, labeled virtual-assistant text utterances spanning 51 languages, 18 domains, 60 intents, and 55 slots. Described in this paper.
  • Facebook - MTOP Benchmark: A Comprehensive Multilingual Task-Oriented Semantic Parsing Benchmark with a dataset comprising of 100k annotated utterances in 6 languages(including Indic language: Hindi) across 11 domains. Described in this paper.

POS Tagged corpus

Chunk Corpus

Dependency Parse Corpus

Coreference Corpus

Summarization

  • XL-Sum: A Large-Scale Multilingual Abstractive Summarization for 44 Languages with a comprehensive and diverse dataset comprising of 1 million professionally annotated article-summary pairs from BBC. Described in this paper.

Data to Text

  • XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages comprising of a high quality XF2T dataset in 7 languages: Hindi, Marathi, Gujarati, Telugu, Tamil, Kannada, Bengali, and monolingual dataset in English. The dataset is available upon request. Described in this paper.

Models

Language Identification

  • NLLB-200: LID for 200 languages including 27 Indic languages.

Word Embeddings

Pre-trained Language Models

  • AI4Bharat IndicBERT: Multilingual ALBERT based embeddings spanning 12 languages for Natural Language Understanding (including Indian English).
  • AI4Bharat IndicBART: A multilingual,sequence-to-sequence pre-trained model based on the mBART architecture focusing on 11 Indic languages and English for Natural Language Generation of Indic Languages. Described in this paper.
  • MuRIL: Multilingual mBERT based embeddings spanning 17 languages and their transliterated counterparts for Natural Language Understanding (paper).
  • BERT Multilingual: BERT model trained on Wikipedias of many languages (including major Indic languages).
  • mBART50: seq2seq pre-trained model trained on CommonCrawl of many languages (including major Indic languages).
  • BLOOM: GPT3 like multilingual transformer-decoder language model (includes major Indic languages.
  • iNLTK: ULMFit and TransformerXL pre-trained embeddings for many languages trained on Wikipedia and some News articles.
  • albert-base-sanskrit: ALBERT-based model trained on Sanskrit Wikipedia.
  • RoBERTa-hindi-guj-san: Multilingual RoBERTa like model trained on Hindi, Sanskrit and Gujarati.
  • Bangla-BERT-Base: Bengali BERT model trained on Bengali wikipedia and OSCAR datasets.
  • BanglaBERT: Language Model Pretraining and Benchmarks for Low-Resource Language Understanding Evaluation in Bangla. Described in this paper.
  • EM-ALBERT: The first ALBERT model available for Manipuri language which is trained on 1,034,715 Manipuri sentences.
  • LaBSE: Encoder models suitable for sentence retrieval tasks supporting 109 languages (including all major Indic languages) [paper].
  • LASER3: Encoder models suitable for sentence retrieval tasks supporting 200 languages (including 27 Indic languges).

Multilingual Word Embeddings

Morphanalyzers

Translation Models

  • IndicTrans: Multilingual neural translation models for translation between English and 11 Indian languages. Supports translation between Indian langauges as well. A total of 110 translation directions are supported.
  • Shata-Anuvaadak: SMT for 110 language pairs (all pairs between English and 10 Indian languages.
  • LTRC Vanee: Dependency based Statistical MT system from English to Hindi.
  • NLLB-200: Models for 200 languages including 27 Indic languages.

Transliteration Models

  • AI4Bharat IndicXlit: A transformer-based multilingual transliteration model with 11M parameters for Roman to native script conversion and vice versa that supports 21 Indic languages. Described in this paper.

Speech Models

NER

Speech Corpora

OCR Corpora

Multimodal Corpora

Language Specific Catalogs

Pointers to language-specific NLP resource catalogs

More Repositories

1

Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT
Python
273
star
2

IndicTrans2

Translation models for 22 scheduled languages of India
Python
223
star
3

indicnlp_corpus

Description Describes the IndicNLP corpus and associated datasets
Python
149
star
4

Indic-TTS

Text-to-Speech for languages of India
Jupyter Notebook
130
star
5

indicTrans

indicTranslate v1 - Machine Translation for 11 Indic languages. For latest v2, check: https://github.com/AI4Bharat/IndicTrans2
Jupyter Notebook
111
star
6

OpenHands

๐Ÿ‘OpenHands : Making Sign Language Recognition Accessible. | **NOTE:** No longer actively maintained. If you are interested to own this and take it forward, please raise an issue
Python
97
star
7

Chitralekha

Chitralekha - A video transcreation platform for Indic languages, supporting transcription, translation and voice-over
95
star
8

IndicLLMSuite

A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages
Python
89
star
9

IndicWav2Vec

Pretraining, fine-tuning and evaluation scripts for Indic-Wav2Vec2
Jupyter Notebook
74
star
10

IndicXlit

Transliteration models for 21 Indic languages
Python
68
star
11

NPTEL2020-Indian-English-Speech-Dataset

NPTEL2020: Speech2Text dataset for Indian-English Accent
Python
68
star
12

IndicBERT

Pretraining, fine-tuning and evaluation scripts for IndicBERT-v2 and IndicXTREME
Python
65
star
13

IndicNLP-Transliteration

Codebase for Indic-Transliteration using Seq2Seq RNN. For latest repo with Transformer-based models, check: https://github.com/AI4Bharat/IndicXlit
Python
58
star
14

Shoonya

Shoonya - Platform to Annotate and label data at scale.
50
star
15

vistaar

Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR
Python
43
star
16

indic-bart

Pre-trained, multilingual sequence-to-sequence models for Indian languages
Python
43
star
17

Chitralekha-Backend

Transcribe your videos and translate it into Indic languages.
Python
27
star
18

Indic-Input-Tool-UI

Web Interface for Transliteration for Indic languages.
JavaScript
22
star
19

Shoonya-Backend

DRF-based API server for Shoonya platform
Python
20
star
20

Svarah

Swarah: Indian-English speech dataset collected across the country
Python
20
star
21

IndicVoices-R

A Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
19
star
22

FBI

FBI: Finding Blindspots in LLM Evaluations with Interpretable Checklists
Python
18
star
23

Shoonya-Frontend

JavaScript
16
star
24

Dhruva-Platform

Dhruva is an open-source platform for serving language AI models at scale.
TypeScript
15
star
25

indic-asr-api-backend

Indic-Conformer models for ASR
Python
13
star
26

INCLUDE

Code for INCLUDE paper with pre-trained models
Python
13
star
27

DocSim

Synthetically generate random text document images with ground-truth
Python
11
star
28

Fonts-for-Indian-Scripts

Font style transfer for Devanฤgarฤซ script using GANs
Python
10
star
29

aacl23-mnmt-tutorial

Additional resources from our AACL tutorial
10
star
30

adapter-efficiency

Python
10
star
31

IndicLID

Language Identification for Indian languages
Python
9
star
32

setu

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
HTML
9
star
33

speech-transcript-cleaning

Perform cleaning and normalization to standardize speech transcripts (train and test) across datasets.
Python
8
star
34

ezAnnotate

Annotation Platform for Machine Learning / Data Science, forked from DataTurks
JavaScript
7
star
35

Anudesh-Frontend

JavaScript
7
star
36

Chitralekha-Frontend

Frontend for Chitralekha platform
JavaScript
7
star
37

transactional-voice-ai

The code for transactional voice AI
Python
6
star
38

Indic-Glossary-Explorer

Glossary service for Indian languages
JavaScript
6
star
39

workshop-nlg-nlu-2022

Material for AI Workshop on Natural Language Understanding and Generation
6
star
40

indicnlp.ai4bharat.org

Archived old website for AI4Bhฤrat Indic-NLP
HTML
5
star
41

Chitralekha-Frontend-Lite

Lightweight version of Chitralekha
JavaScript
5
star
42

Indic-Glossaries

Collection of datasets for glossaries in Indian languages
4
star
43

sign-language.ai4bharat.org

Website for Indian Sign Language Recognition
4
star
44

INCLUDE-MS-Teams-Integration

An experimental Microsoft Teams integration of Sign Language models for word-level sign recognition
C#
4
star
45

Anudesh-Backend

Python
4
star
46

IndicMT-Eval

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation Metrics for Indian Languages, ACL 2023
HTML
4
star
47

IndicVoices

Jupyter Notebook
4
star
48

indic-numtowords

A simple lightweight library for text normalization for Indian Languages
Python
4
star
49

IndicSUPERB

Python
3
star
50

transactional-voice-ai_serving

Deployment code for all the Transactional Voice AI modules.
C++
3
star
51

CTQScorer

Python
3
star
52

Indic-Swipe

IndicSwipe is a collection of datasets and neural model architectures for decoding swipe gesture inputs on touch-based Indic language keyboards across 7 languages.
Python
3
star
53

Indic-OCR

2
star
54

DMU-DataDaan

Codebase for NLTM DMU's Data Upload System
JavaScript
2
star
55

2022.ai4bharat.org

Old website of AI4Bhฤrat using TinaCMS
JavaScript
2
star
56

setu-translate

Python
2
star
57

models.ai4bharat.org

A one stop platform to try out all the models built by the AI4Bharat team.
JavaScript
2
star
58

Shoonya-Frontend-Old

Old version of Shoonya UI. Latest repo: https://github.com/AI4Bharat/Shoonya-Frontend
JavaScript
2
star
59

Varnam-Transliteration-UI

Transliteration Web Interface
JavaScript
1
star
60

ai4b-website

TypeScript
1
star
61

Dhruva-Evaluation-Suite

A tool to perform functional testing and performance testing of the Dhruva Platform
Python
1
star
62

indicnlp_suite

Natural Language Understanding resources for Indian languages
1
star
63

Input-Tools-By-AI4bharat

Enhance your typing experience in Chrome with AI4Bharat's Input Tools Chrome extension. This extension provides real-time transliteration suggestions for Indian languages, offering seamless integration into your typing workflow.
JavaScript
1
star
64

Lahaja

This repository holds the artifacts of 'LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems'
1
star
65

Rasa

Expressive TTS Dataset for Assamese, Bengali, and Tamil.
Python
1
star
66

NeMo

Python
1
star
67

VocabAdaptation_LLM

Python
1
star