• Stars
    star
    117
  • Rank 300,044 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 5 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent)

NLP for Hindi

This repository contains State of the Art Language models and Classifier for Hindi language (spoken in Indian sub-continent).

The models trained here have been used in Natural Language Toolkit for Indic Languages (iNLTK)

Dataset

Created as part of this project

  1. Hindi Wikipedia Articles - 172k

  2. Hindi Wikipedia Articles - 55k

  3. Hindi Movie Reviews Dataset

  4. Hindi Text Short Summarization Corpus

  5. Hindi Text Short and Large Summarization Corpus

Open Source Datasets

  1. BBC News Articles : Sentiment analysis corpus for Hindi documents extracted from BBC news website.

  2. IIT Patna Product Reviews : Sentiment analysis corpus for product reviews posted in Hindi.

  3. IIT Patna Movie Reviews : Sentiment analysis corpus for movie reviews posted in Hindi.

Results

Language Model Perplexity (on validation set)

Architecture/Dataset Hindi Wikipedia Articles - 172k Hindi Wikipedia Articles - 55k
ULMFiT 34.06 35.87
TransformerXL 26.09 34.78

Note: Nirant has done previous SOTA work with Hindi Language Model and achieved perplexity of ~46. The scores above aren't directly comparable with his score because his train and validation set were different and they aren't available for reproducibility

Classification Metrics

ULMFiT
Dataset Accuracy MCC Notebook to Reproduce results
BBC News Articles 78.75 71.61 Link
IIT Patna Movie Reviews 57.74 37.23 Link
IIT Patna Product Reviews 75.71 59.76 Link

Visualizations

Word Embeddings
Architecture Visualization
ULMFiT Embeddings projection
TransformerXL Embeddings projection
Sentence Embeddings
Architecture Visualization
ULMFiT Encodings projection

Results of using Transfer Learning + Data Augmentation from iNLTK

On using complete training set (with Transfer learning)
Dataset Dataset size (train, valid, test) Accuracy MCC Notebook to Reproduce results
IIT Patna Movie Reviews (2480, 310, 310) 57.74 37.23 Link
On using 20% of training set (with Transfer learning)
Dataset Dataset size (train, valid, test) Accuracy MCC Notebook to Reproduce results
IIT Patna Movie Reviews (496, 310, 310) 47.74 20.50 Link
On using 20% of training set (with Transfer learning + Data Augmentation)
Dataset Dataset size (train, valid, test) Accuracy MCC Notebook to Reproduce results
IIT Patna Movie Reviews (496, 310, 310) 56.13 34.39 Link

Pretrained Models

Language Models

Download pretrained Language Models of ULMFiT, TransformerXL trained on Hindi Wikipedia Articles - 172k and Hindi Wikipedia Articles - 55k from here

Tokenizer

Unsupervised training using Google's sentencepiece

Download the trained model and vocabulary from here

More Repositories

1

inltk

Natural Language Toolkit for Indic Languages aims to provide out of the box support for various NLP tasks that an application developer might need
Python
811
star
2

code-with-ai

Interface for people to use my model which predicts which techniques one should use to solve a competitive programming problem to get an AC
Jupyter Notebook
146
star
3

nlp-for-sanskrit

State of the Art Language models and Classifier for Sanskrit language (ancient indian language)
Jupyter Notebook
72
star
4

awesome-agriculture

List of project ideas/references which can help engineers build technology for agriculture which can eventually help farmers
54
star
5

nlp-for-tamil

State of the Art Language models and Classifier for Tamil language (spoken in India, and few other South Asian countries)
Jupyter Notebook
52
star
6

nlp-for-malyalam

State of the Art Language models and Classifier for Malayalam, which is spoken by the Malayali people in the Indian state of Kerala and the union territories of Lakshadweep and Puducherry
Jupyter Notebook
36
star
7

nlp-for-bengali

State of the Art Language models and Classifier for Bengali, which is primarily spoken by the Bengalis in South Asia.
Jupyter Notebook
31
star
8

nlp-for-nepali

State of the Art Language models and Classifier for Nepali, which is official language of Nepal and one of the official status gained language of India
Jupyter Notebook
30
star
9

nlp-for-kannada

State of the Art Language models and Classifier for Kannada, which is spoken predominantly by Kannada people in India, mainly in the state of Karnataka
Jupyter Notebook
28
star
10

nlp-for-gujarati

State of the Art Language models and Classifier for Gujarati, which is a language native to the Indian state of Gujarat
Jupyter Notebook
25
star
11

nlp-for-marathi

State of the Art Language models and Classifier for Marathi, which is spoken predominantly by Marathi people of Maharashtra, India
Jupyter Notebook
25
star
12

nlp-for-hinglish

Jupyter Notebook
22
star
13

nlp-for-punjabi

State of the Art Language models and Classifier for punjabi language (spoken in Indian sub-continent)
Jupyter Notebook
14
star
14

nlp-for-odia

State of the Art Language models and Classifier for Odia, which is spoken in the Indian state of Odisha
Jupyter Notebook
11
star
15

nlp-for-manglish

State of the Art Language models and Classifier for Code mixed Manglish (Malayalam and English) - spoken in Indian sub-continent.
Jupyter Notebook
8
star
16

nlp-for-tanglish

State of the Art Language models and Classifier for Code mixed Tanglish (Tamil and English) - spoken in Indian sub-continent.
Jupyter Notebook
5
star
17

indian-language-classifier

Classifier to distinguish which Indian Language a given text contains
Jupyter Notebook
5
star
18

human-protein-atlas-kaggle-competition

This repository contains my model which was Ranked in Top-17% in Human Protein Atlas Image Classification challenge on Kaggle
Python
3
star
19

isl

Indian sign Language Translation Prototype - Developed during a Hackathon
Jupyter Notebook
3
star
20

ipl-matches-result-prediction

Using Deep Learning to predict IPL Matches Result
2
star
21

goru001.github.io

Liquid
2
star
22

whatssms

Get important messages from Whatsapp groups as SMS on your mobile
Python
1
star
23

stock-market-prediction

Predicting the opening price of stocks using Deep Learning
1
star
24

algorithm-templates

Algorithm Templates for direct usage in competitive-programming.
C++
1
star