NLP - Tutorial
Repository to show how NLP can tacke real problem. Including the source code, dataset, state-of-the art in NLP
Data Augmentation
- Data Augmentation in NLP
- Data Augmentation library for Text
- Does your NLP model able to prevent adversarial attack?
- How does Data Noising Help to Improve your NLP Model?
- Data Augmentation library for Speech Recognition
- Data Augmentation library for Audio
- Unsupervied Data Augmentation
- Adversarial Attacks in Textual Deep Neural Networks
- Back Translation in Text Augmentation by nlpaug
General
Text Preprocessing
Section | Sub-Section | Description | Story |
---|---|---|---|
Tokenization | Subword Tokenization | Medium | |
Tokenization | Word Tokenization | Medium Github | |
Tokenization | Sentence Tokenization | Medium Github | |
Part of Speech | Medium Github | ||
Lemmatization | Medium Github | ||
Stemming | Medium Github | ||
Stop Words | Medium Github | ||
Phrase Word Recognition | |||
Spell Checking | Lexicon-based | Peter Norvig algorithm | Medium Github |
Lexicon-based | Symspell | Medium Github | |
Machine Translation | Statistical Machine Translation | Medium | |
Machine Translation | Attention | Medium | |
String Matching | Fuzzywuzzy | Medium Github |
Text Representation
Section | Sub-Section | Research Lab | Story | Source |
---|---|---|---|---|
Traditional Method | Bag-of-words (BoW) | Medium Github | ||
Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) | Medium Github | |||
Character Level | Character Embedding | NYU | Medium Github | Paper |
Word Level | Negative Sampling and Hierarchical Softmax | Medium | ||
Word2Vec, GloVe, fastText | Medium Github | |||
Contextualized Word Vectors (CoVe) | Salesforce | Medium Github | Paper Code | |
Misspelling Oblivious (word) Embeddings | Medium | Paper | ||
Embeddings from Language Models (ELMo) | AI2 | Medium Github | Paper Code | |
Contextual String Embeddings | Zalando Research | Medium | Paper Code | |
Sentence Level | Skip-thoughts | Medium Github | Paper Code | |
InferSent | Medium Github | Paper Code | ||
Quick-Thoughts | Medium | Paper Code | ||
General Purpose Sentence (GenSen) | Medium | Paper Code | ||
Bidirectional Encoder Representations from Transformers (BERT) | Medium | Paper(2019) Code | ||
Generative Pre-Training (GPT) | OpenAI | Medium | Paper(2019) Code | |
Self-Governing Neural Networks (SGNN) | Medium | Paper | ||
Multi-Task Deep Neural Networks (MT-DNN) | Microsoft | Medium | Paper(2019) | |
Generative Pre-Training-2 (GPT-2) | OpenAI | Medium | Paper(2019) Code | |
Universal Language Model Fine-tuning (ULMFiT) | OpenAI | Medium | Paper Code | |
BERT in Science Domain | Medium | Paper(2019) Paper(2019) | ||
BERT in Clinical Domain | NYU/PU | Medium | Paper(2019) Paper(2019) | |
RoBERTa | UW/Facebook | Medium | Paper(2019) Paper | |
Unified Language Model for NLP and NLU (UNILM) | Microsoft | Medium | Paper(2019) | |
Cross-lingual Language Model (XLMs) | Medium | Paper(2019) | ||
Transformer-XL | CMU/Google | Medium | Paper(2019) | |
XLNet | CMU/Google | Medium | Paper(2019) | |
CTRL | Salesforce | Medium | Paper(2019) | |
ALBERT | Google/Toyota | Medium | Paper(2019) | |
T5 | Googles | Medium | Paper(2019) | |
MultiFiT | Medium | Paper(2019) | ||
XTREME | Medium | Paper(2020) | ||
REALM | Medium | Paper(2020) |
| Document Level | lda2vec | | Medium | Paper | | | doc2vec | Google | Medium Github | Paper |
NLP Problem
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Named Entity Recognition (NER) | Pattern-based Recognition | Medium | |||
Lexicon-based Recognition | Medium | ||||
spaCy Pre-trained NER | Medium Github | ||||
Optical Character Recognition (OCR) | Printed Text | Google Cloud Vision API | Medium | Paper | |
Handwriting | LSTM | Medium | Paper | ||
Text Summarization | Extractive Approach | Medium Github | |||
Abstractive Approach | Medium | ||||
Emotion Recognition | Audio, Text, Visual | 3 Multimodals for Emotion Recognition | Medium |
Acoustic Problem
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Feature Representation | Unsupervised Learning | Introduction to Audio Feature Learning | Medium | Paper 1 Paper 2 Paper 3 | |
Feature Representation | Unsupervised Learning | Speech2Vec and Sentence Level Embeddings | Medium | Paper 1 Paper 2 | |
Feature Representation | Unsupervised Learning | Wav2vec | Medium | Paper | |
Speech-to-text | Introduction to Speeh-to-text | Medium |
Text Distance Measurement
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Euclidean Distance, Cosine Similarity and Jaccard Similarity | Medium Github | ||||
Edit Distance | Levenshtein Distance | Medium Github | |||
Word Moving Distance (WMD) | Medium Github | ||||
Supervised Word Moving Distance (S-WMD) | Medium | ||||
Manhattan LSTM | Medium | Paper |
Model Interpretation
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
ELI5, LIME and Skater | Medium Github | ||||
SHapley Additive exPlanations (SHAP) | Medium Github | ||||
Anchors | Medium Github |
Graph
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Embeddings | TransE, RESCAL, DistMult, ComplEx, PyTorch BigGraph | Medium | RESCAL(2011) TransE(2013) DistMult(2015) ComplEx(2016) PyTorch BigGraph(2019) | ||
Embeddings | DeepWalk, node2vec, LINE, GraphSAGE | Medium | DeepWalk(2014) node2vec(2015) LINE(2015) GraphSAGE(2018) | ||
Embeddings | WLG, GCN, GAT, GIN | Medium | WLG(2011) GCN2017) GAT(2017) GraphSAGE(2018) | ||
Embeddings | PinSAGE(2018) | Medium | |||
Embeddings | HoIE(2015), SimpIE(2018) | Medium | |||
Embeddings | ContE(2017), ETE(2017) | Medium |
Meta-Learning
Section | Sub-Section | Description | Story |
---|---|---|---|
Introduction | Matching Nets(2016) MANN(2016) LSTM-based meta-learner(2017) Prototypical Networks(2017) ARC(2017) MAML(2017) MetaNet(2017) | Medium | |
NLP | Dialog Generation | DAML(2019), PAML(2019), NTMS(2019) | Medium |
Classification | Intent Embeddings(2016) LEOPARD(2019) | Medium | |
CV | Unsupervised Learning | CACTUs(2018) | Medium |
General | Siamese Network(1994), Triplet Network(2015) | Medium | |
MAML+(2018) | Medium |
Image
Section | Sub-Section | Description | Research Lab | Story | Paper & Code |
---|---|---|---|---|---|
Object Detection | R-CNN | Medium | Paper(2013) | ||
Object Detection | Fast R-CNN | Medium | Paper(2015) | ||
Object Detection | Faster R-CNN | Medium | Paper(2015) | ||
Object Detection | VGGNet | Medium | Paper(2014) | ||
Instance Segmentation | Mask R-CNN | FAIR | Medium | Paper(2017) | |
Image Classification | ResNet(2015) | Microsoft | Medium | ||
Image Classification | ResNeXt(2016) | Medium |
Evaluation
Section | Sub-Section | Description | Story |
---|---|---|---|
Introduction | Medium | ||
Classification | Confusion Matrix, ROC, AUC | Medium | |
Regression | MAE, MSE, RMSE, MAPE, WMAPE | Medium | |
Textual | Perplexity, BLEU, GER, WER, GLUE | Medium |
Source Code
Section | Sub-Section | Description | Link |
---|---|---|---|
Spellcheck | Github | ||
InferSent | Github |