• Stars
    star
    811
  • Rank 55,794 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created over 5 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need

Natural Language Toolkit for Indic Languages (iNLTK)

Gitter Downloads

iNLTK aims to provide out of the box support for various NLP tasks that an application developer might need for Indic languages. Paper for iNLTK library has been accepted at EMNLP-2020's NLP-OSS workshop. Here's the link to the paper

Documentation

Checkout detailed docs along with Installation instructions at https://inltk.readthedocs.io

Supported languages

Native languages

Language Code
Hindi hi
Punjabi pa
Gujarati gu
Kannada kn
Malayalam ml
Oriya or
Marathi mr
Bengali bn
Tamil ta
Urdu ur
Nepali ne
Sanskrit sa
English en
Telugu te

Code Mixed languages

Language Script Code
Hinglish (Hindi+English) Latin hi-en
Tanglish (Tamil+English) Latin ta-en
Manglish (Malayalam+English) Latin ml-en

Repositories containing models used in iNLTK

Language Repository Dataset used for Language modeling Perplexity of ULMFiT LM
(on validation set)
Perplexity of TransformerXL LM
(on validation set)
Dataset used for Classification Classification:
Test set Accuracy
Classification:
Test set MCC
Classification: Notebook
for Reproducibility
ULMFiT Embeddings visualization TransformerXL Embeddings visualization
Hindi NLP for Hindi Hindi Wikipedia Articles - 172k


Hindi Wikipedia Articles - 55k
34.06


35.87
26.09


34.78
BBC News Articles


IIT Patna Movie Reviews


IIT Patna Product Reviews
78.75


57.74


75.71
0.71


0.37


0.59
Notebook


Notebook


Notebook
Hindi Embeddings projection Hindi Embeddings projection
Bengali NLP for Bengali Bengali Wikipedia Articles 41.2 39.3 Bengali News Articles (Soham Articles) 90.71 0.87 Notebook Bengali Embeddings projection Bengali Embeddings projection
Gujarati NLP for Gujarati Gujarati Wikipedia Articles 34.12 28.12 iNLTK Headlines Corpus - Gujarati 91.05 0.86 Notebook Gujarati Embeddings projection Gujarati Embeddings projection
Malayalam NLP for Malayalam Malayalam Wikipedia Articles 26.39 25.79 iNLTK Headlines Corpus - Malayalam 95.56 0.93 Notebook Malayalam Embeddings projection Malayalam Embeddings projection
Marathi NLP for Marathi Marathi Wikipedia Articles 18 17.42 iNLTK Headlines Corpus - Marathi 92.40 0.85 Notebook Marathi Embeddings projection Marathi Embeddings projection
Tamil NLP for Tamil Tamil Wikipedia Articles 19.80 17.22 iNLTK Headlines Corpus - Tamil 95.22 0.92 Notebook Tamil Embeddings projection Tamil Embeddings projection
Punjabi NLP for Punjabi Punjabi Wikipedia Articles 24.40 14.03 IndicNLP News Article Classification Dataset - Punjabi 97.12 0.96 Notebook Punjabi Embeddings projection Punjabi Embeddings projection
Kannada NLP for Kannada Kannada Wikipedia Articles 70.10 61.97 IndicNLP News Article Classification Dataset - Kannada 98.87 0.98 Notebook Kannada Embeddings projection Kannada Embeddings projection
Oriya NLP for Oriya Oriya Wikipedia Articles 26.57 26.81 IndicNLP News Article Classification Dataset - Oriya 98.83 0.98 Notebook Oriya Embeddings Projection Oriya Embeddings Projection
Sanskrit NLP for Sanskrit Sanskrit Wikipedia Articles ~6 ~3 Sanskrit Shlokas Dataset 84.3 (valid set) Sanskrit Embeddings projection Sanskrit Embeddings projection
Nepali NLP for Nepali Nepali Wikipedia Articles 31.5 29.3 Nepali News Dataset 98.5 (valid set) Nepali Embeddings projection Nepali Embeddings projection
Urdu NLP for Urdu Urdu Wikipedia Articles 13.19 12.55 Urdu News Dataset 95.28 (valid set) Urdu Embeddings projection Urdu Embeddings projection
Telugu NLP for Telugu Telugu Wikipedia Articles 27.47 29.44 Telugu News Dataset


Telugu News Andhra Jyoti
95.4


92.09
Notebook


Notebook
Telugu Embeddings projection Telugu Embeddings projection
Tanglish NLP for Tanglish Synthetic Tanglish Dataset 37.50 - Dravidian Codemix HASOC @ FIRE 2020

Dravidian Codemix Sentiment Analysis @ FIRE 2020
F1 Score: 0.88

F1 Score: 0.62
- Notebook

Notebook
Tanglish Embeddings Projection -
Manglish NLP for Manglish Synthetic Manglish Dataset 45.84 - Dravidian Codemix HASOC @ FIRE 2020

Dravidian Codemix Sentiment Analysis @ FIRE 2020
F1 Score: 0.74

F1 Score: 0.69
- Notebook

Notebook
Manglish Embeddings Projection -
Hinglish NLP for Hinglish Synthetic Hinglish Dataset 86.48 - - - - - Hinglish Embeddings Projection -

Note: English model has been directly taken from fast.ai

Effect of using Transfer Learning + Paraphrases from iNLTK

Language Repository Dataset used for Classification Results on using
complete training set
Percentage Decrease
in Training set size
Results on using
reduced training set
without Paraphrases
Results on using
reduced training set
with Paraphrases
Hindi NLP for Hindi IIT Patna Movie Reviews Accuracy: 57.74

MCC: 37.23
80% (2480 -> 496) Accuracy: 47.74

MCC: 20.50
Accuracy: 56.13

MCC: 34.39
Bengali NLP for Bengali Bengali News Articles (Soham Articles) Accuracy: 90.71

MCC: 87.92
99% (11284 -> 112) Accuracy: 69.88

MCC: 61.56
Accuracy: 74.06

MCC: 65.08
Gujarati NLP for Gujarati iNLTK Headlines Corpus - Gujarati Accuracy: 91.05

MCC: 86.09
90% (5269 -> 526) Accuracy: 80.88

MCC: 70.18
Accuracy: 81.03

MCC: 70.44
Malayalam NLP for Malayalam iNLTK Headlines Corpus - Malayalam Accuracy: 95.56

MCC: 93.29
90% (5036 -> 503) Accuracy: 82.38

MCC: 73.47
Accuracy: 84.29

MCC: 76.36
Marathi NLP for Marathi iNLTK Headlines Corpus - Marathi Accuracy: 92.40

MCC: 85.23
95% (9672 -> 483) Accuracy: 84.13

MCC: 68.59
Accuracy: 84.55

MCC: 69.11
Tamil NLP for Tamil iNLTK Headlines Corpus - Tamil Accuracy: 95.22

MCC: 92.70
95% (5346 -> 267) Accuracy: 86.25

MCC: 79.42
Accuracy: 89.84

MCC: 84.63

For more details around implementation or to reproduce results, checkout respective repositories.

Contributing

Add a new language support

If you would like to add support for language of your own choice to iNLTK, please start with checking/raising a issue here

Please checkout the steps I'd mentioned here for Telugu to begin with. They should be almost similar for other languages as well.

Improving models/using models for your own research

If you would like to take iNLTK's models and refine them with your own dataset or build your own custom models on top of it, please check out the repositories in the above table for the language of your choice. The repositories above contain links to datasets, pretrained models, classifiers and all of the code for that.

Add new functionality

If you wish for a particular functionality in iNLTK - Start by checking/raising a issue here

What's next

..and being worked upon

Shout out if you want to help :)

..and NOT being worked upon

Shout out if you want to lead :)

iNLTK's Appreciation

Citation

If you use this library in your research, please consider citing:

@inproceedings{arora-2020-inltk,
    title = "i{NLTK}: Natural Language Toolkit for Indic Languages",
    author = "Arora, Gaurav",
    booktitle = "Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.nlposs-1.10",
    doi = "10.18653/v1/2020.nlposs-1.10",
    pages = "66--71",
    abstract = "We present iNLTK, an open-source NLP library consisting of pre-trained language models and out-of-the-box support for Data Augmentation, Textual Similarity, Sentence Embeddings, Word Embeddings, Tokenization and Text Generation in 13 Indic Languages. By using pre-trained models from iNLTK for text classification on publicly available datasets, we significantly outperform previously reported results. On these datasets, we also show that by using pre-trained models and data augmentation from iNLTK, we can achieve more than 95{\%} of the previous best performance by using less than 10{\%} of the training data. iNLTK is already being widely used by the community and has 40,000+ downloads, 600+ stars and 100+ forks on GitHub. The library is available at https://github.com/goru001/inltk.",
}

More Repositories

1

code-with-ai

Interface for people to use my model which predicts which techniques one should use to solve a competitive programming problem to get an AC
Jupyter Notebook
146
star
2

nlp-for-hindi

State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent)
Jupyter Notebook
117
star
3

nlp-for-sanskrit

State of the Art Language models and Classifier for Sanskrit language (ancient indian language)
Jupyter Notebook
72
star
4

awesome-agriculture

List of project ideas/references which can help engineers build technology for agriculture which can eventually help farmers
54
star
5

nlp-for-tamil

State of the Art Language models and Classifier for Tamil language (spoken in India, and few other South Asian countries)
Jupyter Notebook
52
star
6

nlp-for-malyalam

State of the Art Language models and Classifier for Malayalam, which is spoken by the Malayali people in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry
Jupyter Notebook
36
star
7

nlp-for-bengali

State of the Art Language models and Classifier for Bengali, which is primarily spoken by the Bengalis in South Asia.
Jupyter Notebook
31
star
8

nlp-for-nepali

State of the Art Language models and Classifier for Nepali, which is official language of Nepal and one of the official status gained language of India
Jupyter Notebook
30
star
9

nlp-for-kannada

State of the Art Language models and Classifier for Kannada, which is spoken predominantly by Kannada people in India, mainly in the state of Karnataka
Jupyter Notebook
28
star
10

nlp-for-gujarati

State of the Art Language models and Classifier for Gujarati, which is a language native to the Indian state of Gujarat
Jupyter Notebook
25
star
11

nlp-for-marathi

State of the Art Language models and Classifier for Marathi, which is spoken predominantly by Marathi people of Maharashtra, India
Jupyter Notebook
25
star
12

nlp-for-hinglish

Jupyter Notebook
22
star
13

nlp-for-punjabi

State of the Art Language models and Classifier for punjabi language (spoken in Indian sub-continent)
Jupyter Notebook
14
star
14

nlp-for-odia

State of the Art Language models and Classifier for Odia, which is spoken in the Indian state of Odisha
Jupyter Notebook
11
star
15

nlp-for-manglish

State of the Art Language models and Classifier for Code mixed Manglish (Malayalam and English) - spoken in Indian sub-continent.
Jupyter Notebook
8
star
16

nlp-for-tanglish

State of the Art Language models and Classifier for Code mixed Tanglish (Tamil and English) - spoken in Indian sub-continent.
Jupyter Notebook
5
star
17

indian-language-classifier

Classifier to distinguish which Indian Language a given text contains
Jupyter Notebook
5
star
18

human-protein-atlas-kaggle-competition

This repository contains my model which was Ranked in Top-17% in Human Protein Atlas Image Classification challenge on Kaggle
Python
3
star
19

isl

Indian sign Language Translation Prototype - Developed during a Hackathon
Jupyter Notebook
3
star
20

ipl-matches-result-prediction

Using Deep Learning to predict IPL Matches Result
2
star
21

goru001.github.io

Liquid
2
star
22

whatssms

Get important messages from Whatsapp groups as SMS on your mobile
Python
1
star
23

stock-market-prediction

Predicting the opening price of stocks using Deep Learning
1
star
24

algorithm-templates

Algorithm Templates for direct usage in competitive-programming.
C++
1
star