Deep Learning for Text Pairs Relation Classification
This repository is my bachelor graduation project, and it is also a study of TensorFlow, Deep Learning (CNN, RNN, etc.).
The main objective of the project is to determine whether the two sentences are similar in sentence meaning (binary classification problems) by the two given sentences based on Neural Networks (Fasttext, CNN, LSTM, etc.).
Requirements
- Python 3.6
- Tensorflow 1.15.0
- Tensorboard 1.15.0
- Sklearn 0.19.1
- Numpy 1.16.2
- Gensim 3.8.3
- Tqdm 4.49.0
Project
The project structure is below:
.
βββ Model
βΒ Β βββ test_model.py
βΒ Β βββ text_model.py
βΒ Β βββ train_model.py
βββ data
βΒ Β βββ word2vec_100.model.* [Need Download]
βΒ Β βββ Test_sample.json
βΒ Β βββ Train_sample.json
βΒ Β βββ Validation_sample.json
βββ utils
βΒ Β βββ checkmate.py
βΒ Β βββ data_helpers.py
βΒ Β βββ param_parser.py
βββ LICENSE
βββ README.md
βββ requirements.txt
Innovation
Data part
- Make the data support Chinese and English (Can use
jieba
ornltk
). - Can use your pre-trained word vectors (Can use
gensim
). - Add embedding visualization based on the tensorboard (Need to create
metadata.tsv
first).
Model part
- Add the correct L2 loss calculation operation.
- Add gradients clip operation to prevent gradient explosion.
- Add learning rate decay with exponential decay.
- Add a new Highway Layer (Which is useful according to the model performance).
- Add Batch Normalization Layer.
- Add several performance measures (especially the AUC) since the data is imbalanced.
Code part
- Can choose to train the model directly or restore the model from the checkpoint in
train.py
. - Can create the prediction file which including the predicted values and predicted labels of the Testset data in
test.py
. - Add other useful data preprocess functions in
data_helpers.py
. - Use
logging
for helping to record the whole info (including parameters display, model training info, etc.). - Provide the ability to save the best n checkpoints in
checkmate.py
, whereas thetf.train.Saver
can only save the last n checkpoints.
Data
See data format in /data
folder which including the data sample files. For example:
{"front_testid": "4270954", "behind_testid": "7075962", "front_features": ["invention", "inorganic", "fiber", "based", "calcium", "sulfate", "dihydrate", "calcium"], "behind_features": ["vcsel", "structure", "thermal", "management", "structure", "designed"], "label": 0}
- "testid": just the id.
- "features": the word segment (after removing the stopwords)
- "label": 0 or 1. 1 means that two sentences are similar, and 0 means the opposite.
Text Segment
-
You can use
nltk
package if you are going to deal with the English text data. -
You can use
jieba
package if you are going to deal with the Chinese text data.
Data Format
This repository can be used in other datasets (text pairs similarity classification) in two ways:
- Modify your datasets into the same format of the sample.
- Modify the data preprocessing code in
data_helpers.py
.
Anyway, it should depend on what your data and task are.
Pre-trained Word Vectors
You can download the Word2vec model file (dim=100). Make sure they are unzipped and under the /data
folder.
You can pre-training your word vectors (based on your corpus) in many ways:
- Use
gensim
package to pre-train data. - Use
glove
tools to pre-train data. - Even can use a fasttext network to pre-train data.
π€Before you open the new issue, please check the data sample file under the data
folder and read the other open issues first, because someone maybe ask the same question already.
Usage
See Usage.
Network Structure
FastText
References:
TextANN
References:
- Personal ideas π
TextCNN
References:
- Convolutional Neural Networks for Sentence Classification
- A Sensitivity Analysis of (and Practitioners' Guide to) Convolutional Neural Networks for Sentence Classification
TextRNN
Warning: Model can use but not finished yet π€ͺ!
TODO
- Add BN-LSTM cell unit.
- Add attention.
References:
TextCRNN
References:
- Personal ideas π
TextRCNN
References:
- Personal ideas π
TextHAN
References:
TextSANN
Warning: Model can use but not finished yet π€ͺ!
TODO
- Add attention penalization loss.
- Add visualization.
References:
TextABCNN
Warning: Only achieve the ABCNN-1 Modelπ€ͺ!
TODO
- Add ABCNN-3 model.
References:
About Me
ι»ε¨οΌRandolph
SCU SE Bachelor; USTC CS Ph.D.
Email: [email protected]
My Blog: randolph.pro
LinkedIn: randolph's linkedin