AI4Bharat-IndicNLP Dataset
The AI4Bharat-IndicNLP dataset is an ongoing effort to create a collection of large-scale, general-domain corpora for Indian languages. Currently, it contains 2.7 billion words for 10 Indian languages from two language families. We share pre-trained word embeddings trained on these corpora. We create news article category classification datasets for 9 languages to evaluate the embeddings. We evaluate the IndicNLP embeddings on multiple evaluation tasks.
You can read details regarding the corpus and other resources HERE. We showcased the AI4Bharat-IndicNLP dataset at REPL4NLP 2020 (collocated with ACL 2020) (non-archival submission as extended abstract). You can see the talk here: VIDEO.
You can use the IndicNLP corpus and embeddings for multiple Indian language tasks. A comprehensive list of Indian language NLP resources can be found in the IndicNLP Catalog. For processing the Indian language text, you can use the Indic NLP Library.
Table of contents
- Text Corpora
- Word Embeddings
- IndicNLP News Article Classification Dataset
- Publicly available Classification Datasets
- Morphanalyzers
- Citing
- License
- Contributors
- Contact
Text Corpora
The text corpus for 12 languages.
Language | # News Articles* | Sentences | Tokens | Link |
---|---|---|---|---|
as | 0.60M | 1.39M | 32.6M | link |
bn | 3.83M | 39.9M | 836M | link |
en | 3.49M | 54.3M | 1.22B | link |
gu | 2.63M | 41.1M | 719M | link |
hi | 4.95M | 63.1M | 1.86B | link |
kn | 3.76M | 53.3M | 713M | link |
ml | 4.75M | 50.2M | 721M | link |
mr | 2.31M | 34.0M | 551M | link |
or | 0.69M | 6.94M | 107M | link |
pa | 2.64M | 29.2M | 773M | link |
ta | 4.41M | 31.5M | 582M | link |
te | 3.98M | 47.9M | 674M | link |
Note
- The vocabulary frequency files contain the frequency of all unique tokens in the corpus. Each line contains one word along with frequency delimited by tab.
- For convenience, the corpus is already tokenized using the IndicNLP tokenizer. You can use the IndicNLP detokenizer in case you want a detokenized version.
Pre-requisites
To replicate the results reported in the paper, training and evaluation scripts are provided.
To run these scripts, the following tools/packages are required:
For Python packages to install, see requirements.txt
Word Embeddings
DOWNLOAD
Version 1
language | pa | hi | bn | or | gu | mr | kn | te | ml | ta |
---|---|---|---|---|---|---|---|---|---|---|
vectors | link | link | link | link | link | link | link | link | link | link |
model | link | link | link | link | link | link | link | link | link | link |
Training word embeddings
$FASTTEXT_HOME/build/fasttext skipgram \
-epoch 10 -thread 30 -ws 5 -neg 10 -minCount 5 -dim 300 \
-input $mono_path \
-output $output_emb_prefix
Evaluation on word similarity task
Evaluate on the IIIT-H Word Similarity Database: DOWNLOAD
The above mentioned link is a cleaned version of the same database found HERE.
Evaluation Command
python scripts/word_similarity/wordsim.py \
<embedding_file_path> \
<word_sim_db_path> \
<max_vocab>
Evaluation on word analogy task
Evaluate on the Facebook word analogy dataset.
Evaluation Command
First, add MUSE root directory to Python Path
export PYTHONPATH=$PYTHONPATH:$MUSE_PATH
python scripts/word_analogy/word_analogy.py \
--analogy_fname <analogy_fname> \
--embeddings_path <embedding_file_path> \
--lang 'hi' \
--emb_dim 300 \
--cuda
IndicNLP News Article Classification Dataset
We used the IndicNLP text corpora to create classification datasets comprising news articles and their categories for 9 languages. The dataset is balanced across classes. The following table contains the statistics of our dataset:
Language | Classes | Articles per Class |
---|---|---|
Bengali | entertainment, sports | 7K |
Gujarati | business, entertainment, sports | 680 |
Kannada | entertainment, lifestyle, sports | 10K |
Malayalam | business, entertainment, sports, technology | 1.5K |
Marathi | entertainment, lifestyle, sports | 1.5K |
Oriya | business, crime, entertainment, sports | 7.5K |
Punjabi | business, entertainment, sports, politics | 780 |
Tamil | entertainment, politics, sport | 3.9K |
Telugu | entertainment, business, sports | 8K |
Evaluation Command
python3 scripts/txtcls.py --emb_path <path> --data_dir <path> --lang <lang code>
Publicly available Classification Datasets
We also evaluated the IndicNLP embeddings on many publicly available classification datasets.
- ACTSA Corpus: Sentiment analysis corpus for Telugu sentences.
- BBC News Articles: Text classification corpus for Hindi documents extracted from BBC news website.
- IIT Patna Product Reviews: Sentiment analysis corpus for product reviews posted in Hindi.
- INLTK Headlines Corpus: Obtained from inltk project. The corpus is a collection of headlines tagged with their news category. Available for langauges: gu, ml, mr and ta.
- IIT Patna Movie Reviews: Sentiment analysis corpus for movie reviews posted in Hindi.
- Bengali News Articles: Contains Bengali news articles tagged with their news category.
We have created standard test, validation and test splits for the above mentioned datasets. You can download them to evaluate your embeddings.
Evaluation Command
To evaluate your embeddings on the above mentioned datasets, first download them and then run the following command:
python3 scripts/txtcls.py --emb_path <path> --data_dir <path> --lang <lang code>
License
These datasets are available under original license for each public dataset.
Morphanalyzers
IndicNLP Morphanalyzers are unsupervised morphanalyzers trained with morfessor.
DOWNLOAD
Version 1
pa | hi | bn | or | gu | mr | kn | te | ml | ta |
---|
Training Command
## extract vocabulary from embedings file
zcat $embedding_vectors_path | \
tail -n +2 | \
cut -f 1 -d ' ' > $vocab_file_path
## train morfessor
morfessor-train -d ones \
-S $model_file_path \
--logfile $log_file_path \
--traindata-list $vocab_file_path \
--max-epoch 10
Citing
If you are using any of the resources, please cite the following article:
@article{kunchukuttan2020indicnlpcorpus,
title={AI4Bharat-IndicNLP Corpus: Monolingual Corpora and Word Embeddings for Indic Languages},
author={Anoop Kunchukuttan and Divyanshu Kakwani and Satish Golla and Gokul N.C. and Avik Bhattacharyya and Mitesh M. Khapra and Pratyush Kumar},
year={2020},
journal={arXiv preprint arXiv:2005.00085},
}
We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.
License
IndicNLP Corpus is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Contributors
- Anoop Kunchukuttan
- Divyanshu Kakwani
- Satish Golla
- Gokul NC
- Avik Bhattacharyya
- Mitesh Khapra
- Pratyush Kumar
This work is the outcome of a volunteer effort as part of AI4Bharat initiative.
Contact
- Anoop Kunchukuttan ([email protected])
- Mitesh Khapra ([email protected])
- Pratyush Kumar ([email protected])