• Stars
    star
    149
  • Rank 248,619 (Top 5 %)
  • Language
    Python
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Description Describes the IndicNLP corpus and associated datasets

AI4Bharat-IndicNLP Dataset

The AI4Bharat-IndicNLP dataset is an ongoing effort to create a collection of large-scale, general-domain corpora for Indian languages. Currently, it contains 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We evaluate the IndicNLP embeddings on multiple evaluation tasks.

You can read details regarding the corpus and other resources HERE. We showcased the AI4Bharat-IndicNLP dataset at REPL4NLP 2020 (collocated with ACL 2020) (non-archival submission as extended abstract). You can see the talk here: VIDEO.

You can use the IndicNLP corpus and embeddings for multiple Indian language tasks. A comprehensive list of Indian language NLP resources can be found in the IndicNLP Catalog. For processing the Indian language text, you can use the Indic NLP Library.

Table of contents

Text Corpora

The text corpus for 12 languages.

Language # News Articles* Sentences Tokens Link
as 0.60M 1.39M 32.6M link
bn 3.83M 39.9M 836M link
en 3.49M 54.3M 1.22B link
gu 2.63M 41.1M 719M link
hi 4.95M 63.1M 1.86B link
kn 3.76M 53.3M 713M link
ml 4.75M 50.2M 721M link
mr 2.31M 34.0M 551M link
or 0.69M 6.94M 107M link
pa 2.64M 29.2M 773M link
ta 4.41M 31.5M 582M link
te 3.98M 47.9M 674M link

Note

  • The vocabulary frequency files contain the frequency of all unique tokens in the corpus. Each line contains one word along with frequency delimited by tab.
  • For convenience, the corpus is already tokenized using the IndicNLP tokenizer. You can use the IndicNLP detokenizer in case you want a detokenized version.

Pre-requisites

To replicate the results reported in the paper, training and evaluation scripts are provided.

To run these scripts, the following tools/packages are required:

For Python packages to install, see requirements.txt

Word Embeddings

DOWNLOAD

Version 1

language pa hi bn or gu mr kn te ml ta
vectors link link link link link link link link link link
model link link link link link link link link link link

Training word embeddings

$FASTTEXT_HOME/build/fasttext skipgram \
	-epoch 10 -thread 30 -ws 5 -neg 10    -minCount 5 -dim 300 \
	-input $mono_path \
	-output $output_emb_prefix 

Evaluation on word similarity task

Evaluate on the IIIT-H Word Similarity Database: DOWNLOAD

The above mentioned link is a cleaned version of the same database found HERE.

Evaluation Command

python scripts/word_similarity/wordsim.py \
	<embedding_file_path> \
	<word_sim_db_path> \
	<max_vocab>

Evaluation on word analogy task

Evaluate on the Facebook word analogy dataset.

Evaluation Command

First, add MUSE root directory to Python Path

export PYTHONPATH=$PYTHONPATH:$MUSE_PATH
python  scripts/word_analogy/word_analogy.py \
    --analogy_fname <analogy_fname> \
    --embeddings_path <embedding_file_path> \
    --lang 'hi' \
    --emb_dim 300 \
    --cuda

IndicNLP News Article Classification Dataset

We used the IndicNLP text corpora to create classification datasets comprising news articles and their categories for 9 languages. The dataset is balanced across classes. The following table contains the statistics of our dataset:

Language Classes Articles per Class
Bengali entertainment, sports 7K
Gujarati business, entertainment, sports 680
Kannada entertainment, lifestyle, sports 10K
Malayalam business, entertainment, sports, technology 1.5K
Marathi entertainment, lifestyle, sports 1.5K
Oriya business, crime, entertainment, sports 7.5K
Punjabi business, entertainment, sports, politics 780
Tamil entertainment, politics, sport 3.9K
Telugu entertainment, business, sports 8K

DOWNLOAD

Evaluation Command

python3 scripts/txtcls.py --emb_path <path> --data_dir <path> --lang <lang code>

Publicly available Classification Datasets

We also evaluated the IndicNLP embeddings on many publicly available classification datasets.

  • ACTSA Corpus: Sentiment analysis corpus for Telugu sentences.
  • BBC News Articles: Text classification corpus for Hindi documents extracted from BBC news website.
  • IIT Patna Product Reviews: Sentiment analysis corpus for product reviews posted in Hindi.
  • INLTK Headlines Corpus: Obtained from inltk project. The corpus is a collection of headlines tagged with their news category. Available for langauges: gu, ml, mr and ta.
  • IIT Patna Movie Reviews: Sentiment analysis corpus for movie reviews posted in Hindi.
  • Bengali News Articles: Contains Bengali news articles tagged with their news category.

We have created standard test, validation and test splits for the above mentioned datasets. You can download them to evaluate your embeddings.

DOWNLOAD

Evaluation Command

To evaluate your embeddings on the above mentioned datasets, first download them and then run the following command:

python3 scripts/txtcls.py --emb_path <path> --data_dir <path> --lang <lang code>

License

These datasets are available under original license for each public dataset.

Morphanalyzers

IndicNLP Morphanalyzers are unsupervised morphanalyzers trained with morfessor.

DOWNLOAD

Version 1

pa hi bn or gu mr kn te ml ta

Training Command

## extract vocabulary from embedings file
zcat $embedding_vectors_path |  \
    tail -n +2 | \
    cut -f 1 -d ' '  > $vocab_file_path

## train morfessor 
morfessor-train -d ones \
        -S $model_file_path \
        --logfile  $log_file_path \
        --traindata-list $vocab_file_path \
        --max-epoch 10 

Citing

If you are using any of the resources, please cite the following article:

@article{kunchukuttan2020indicnlpcorpus,
    title={AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages},
    author={Anoop Kunchukuttan and Divyanshu Kakwani and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
    year={2020},
    journal={arXiv preprint arXiv:2005.00085},
}

We would like to hear from you if:

  • You are using our resources. Please let us know how you are putting these resources to use.
  • You have any feedback on these resources.

License

Creative Commons License
IndicNLP Corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Contributors

  • Anoop Kunchukuttan
  • Divyanshu Kakwani
  • Satish Golla
  • Gokul NC
  • Avik Bhattacharyya
  • Mitesh Khapra
  • Pratyush Kumar

This work is the outcome of a volunteer effort as part of AI4Bharat initiative.

Contact

More Repositories

1

indicnlp_catalog

A collaborative catalog of NLP resources for Indic languages
543
star
2

Indic-BERT-v1

Indic-BERT-v1: BERT-based Multilingual Model for 11 Indic Languages and Indian-English. For latest Indic-BERT v2, check: https://github.com/AI4Bharat/IndicBERT
Python
273
star
3

IndicTrans2

Translation models for 22 scheduled languages of India
Python
223
star
4

Indic-TTS

Text-to-Speech for languages of India
Jupyter Notebook
130
star
5

indicTrans

indicTranslate v1 - Machine Translation for 11 Indic languages. For latest v2, check: https://github.com/AI4Bharat/IndicTrans2
Jupyter Notebook
111
star
6

OpenHands

👐OpenHands : Making Sign Language Recognition Accessible. | **NOTE:** No longer actively maintained. If you are interested to own this and take it forward, please raise an issue
Python
97
star
7

Chitralekha

Chitralekha - A video transcreation platform for Indic languages, supporting transcription, translation and voice-over
95
star
8

IndicLLMSuite

A blueprint for creating Pretraining and Fine-Tuning datasets for Indic languages
Python
89
star
9

IndicWav2Vec

Pretraining, fine-tuning and evaluation scripts for Indic-Wav2Vec2
Jupyter Notebook
74
star
10

IndicXlit

Transliteration models for 21 Indic languages
Python
68
star
11

NPTEL2020-Indian-English-Speech-Dataset

NPTEL2020: Speech2Text dataset for Indian-English Accent
Python
68
star
12

IndicBERT

Pretraining, fine-tuning and evaluation scripts for IndicBERT-v2 and IndicXTREME
Python
65
star
13

IndicNLP-Transliteration

Codebase for Indic-Transliteration using Seq2Seq RNN. For latest repo with Transformer-based models, check: https://github.com/AI4Bharat/IndicXlit
Python
58
star
14

Shoonya

Shoonya - Platform to Annotate and label data at scale.
50
star
15

vistaar

Vistaar: Diverse Benchmarks and Training Sets for Indian Language ASR
Python
43
star
16

indic-bart

Pre-trained, multilingual sequence-to-sequence models for Indian languages
Python
43
star
17

Chitralekha-Backend

Transcribe your videos and translate it into Indic languages.
Python
27
star
18

Indic-Input-Tool-UI

Web Interface for Transliteration for Indic languages.
JavaScript
22
star
19

Shoonya-Backend

DRF-based API server for Shoonya platform
Python
20
star
20

Svarah

Swarah: Indian-English speech dataset collected across the country
Python
20
star
21

IndicVoices-R

A Massive Multilingual Multi-speaker Speech Corpus for Scaling Indian TTS
19
star
22

FBI

FBI: Finding Blindspots in LLM Evaluations with Interpretable Checklists
Python
18
star
23

Shoonya-Frontend

JavaScript
16
star
24

Dhruva-Platform

Dhruva is an open-source platform for serving language AI models at scale.
TypeScript
15
star
25

indic-asr-api-backend

Indic-Conformer models for ASR
Python
13
star
26

INCLUDE

Code for INCLUDE paper with pre-trained models
Python
13
star
27

DocSim

Synthetically generate random text document images with ground-truth
Python
11
star
28

Fonts-for-Indian-Scripts

Font style transfer for Devanāgarī script using GANs
Python
10
star
29

aacl23-mnmt-tutorial

Additional resources from our AACL tutorial
10
star
30

adapter-efficiency

Python
10
star
31

IndicLID

Language Identification for Indian languages
Python
9
star
32

setu

Setu is a comprehensive pipeline designed to clean, filter, and deduplicate diverse data sources including Web, PDF, and Speech data. Built on Apache Spark, Setu encompasses four key stages: document preparation, document cleaning and analysis, flagging and filtering, and deduplication.
HTML
9
star
33

speech-transcript-cleaning

Perform cleaning and normalization to standardize speech transcripts (train and test) across datasets.
Python
8
star
34

ezAnnotate

Annotation Platform for Machine Learning / Data Science, forked from DataTurks
JavaScript
7
star
35

Anudesh-Frontend

JavaScript
7
star
36

Chitralekha-Frontend

Frontend for Chitralekha platform
JavaScript
7
star
37

transactional-voice-ai

The code for transactional voice AI
Python
6
star
38

Indic-Glossary-Explorer

Glossary service for Indian languages
JavaScript
6
star
39

workshop-nlg-nlu-2022

Material for AI Workshop on Natural Language Understanding and Generation
6
star
40

indicnlp.ai4bharat.org

Archived old website for AI4Bhārat Indic-NLP
HTML
5
star
41

Chitralekha-Frontend-Lite

Lightweight version of Chitralekha
JavaScript
5
star
42

Indic-Glossaries

Collection of datasets for glossaries in Indian languages
4
star
43

CIA

Code for training, evaluating and using a cross-lingual Auto Evaluator
Python
4
star
44

sign-language.ai4bharat.org

Website for Indian Sign Language Recognition
4
star
45

INCLUDE-MS-Teams-Integration

An experimental Microsoft Teams integration of Sign Language models for word-level sign recognition
C#
4
star
46

Anudesh-Backend

Python
4
star
47

IndicMT-Eval

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation Metrics for Indian Languages, ACL 2023
HTML
4
star
48

IndicVoices

Jupyter Notebook
4
star
49

indic-numtowords

A simple lightweight library for text normalization for Indian Languages
Python
4
star
50

IndicSUPERB

Python
3
star
51

transactional-voice-ai_serving

Deployment code for all the Transactional Voice AI modules.
C++
3
star
52

CTQScorer

Python
3
star
53

Indic-Swipe

IndicSwipe is a collection of datasets and neural model architectures for decoding swipe gesture inputs on touch-based Indic language keyboards across 7 languages.
Python
3
star
54

Indic-OCR

2
star
55

DMU-DataDaan

Codebase for NLTM DMU's Data Upload System
JavaScript
2
star
56

2022.ai4bharat.org

Old website of AI4Bhārat using TinaCMS
JavaScript
2
star
57

setu-translate

Python
2
star
58

models.ai4bharat.org

A one stop platform to try out all the models built by the AI4Bharat team.
JavaScript
2
star
59

Shoonya-Frontend-Old

Old version of Shoonya UI. Latest repo: https://github.com/AI4Bharat/Shoonya-Frontend
JavaScript
2
star
60

Varnam-Transliteration-UI

Transliteration Web Interface
JavaScript
1
star
61

ai4b-website

TypeScript
1
star
62

Dhruva-Evaluation-Suite

A tool to perform functional testing and performance testing of the Dhruva Platform
Python
1
star
63

indicnlp_suite

Natural Language Understanding resources for Indian languages
1
star
64

Input-Tools-By-AI4bharat

Enhance your typing experience in Chrome with AI4Bharat's Input Tools Chrome extension. This extension provides real-time transliteration suggestions for Indian languages, offering seamless integration into your typing workflow.
JavaScript
1
star
65

Lahaja

This repository holds the artifacts of 'LAHAJA: A Robust Multi-accent Benchmark for Evaluating Hindi ASR Systems'
1
star
66

Rasa

Expressive TTS Dataset for Assamese, Bengali, and Tamil.
Python
1
star
67

NeMo

Python
1
star
68

VocabAdaptation_LLM

Python
1
star