• This repository has been archived on 14/Aug/2019
  • Stars
    star
    326
  • Rank 129,027 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 7 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Various models and code (Manhattan LSTM, Siamese LSTM + Matching Layer, BiMPM) for the paraphrase identification task, specifically with the Quora Question Pairs dataset.

Build Status codecov

paraphrase-id-tensorflow

Various models and code for paraphrase identification implemented in Tensorflow (1.1.0).

I took great care to document the code and explain what I'm doing at various steps throughout the models; hopefully it'll be didactic example code for those looking to get started with Tensorflow!

So far, this repo has implemented:

PR's to add more models / optimize or patch existing ones are more than welcome! The bulk of the model code resides in duplicate_questions/models

A lot of the data processing code is taken from / inspired by allenai/deep_qa, go check them out if you like how this project is structured!

Installation

This project was developed in and has been tested on Python 3.5 (it likely doesn't work with other versions of Python), and the package requirements are in requirements.txt.

To install the requirements:

pip install -r requirements.txt

Note that after installing the requirements, you have to download the necessary NLTK data by running (in your shell):

python -m nltk.downloader punkt

GPU Training and Inference

Note that the requirements.txt file specify tensorflow as a dependency, which is a CPU-bound version of tensorflow. If you have a GPU, you should uninstall this CPU TensorFlow and install the GPU version by running:

pip uninstall tensorflow
pip install tensorflow-gpu

Getting / Processing The Data

To begin, run the following to generate the auxiliary directories for storing data, trained models, and logs:

make aux_dirs

In addition, if you want to use pretrained GloVe vectors, run:

make glove

which will download pretrained Glove vectors to data/external/. Extract the files in that same directory.

Quora Question Pairs

To use the Quora Question Pairs data, download the dataset from Kaggle (may require an account). Place the downloaded zip archives in data/raw/, and extract the files to that same directory.

Then, run:

make quora_data

to automatically clean and process the data with the scripts in scripts/data/quora.

Running models

To train a model or load + predict with a model, then run the scripts in scripts/run_model/ with python <script_path>. You can get additional documentation about the parameters they take by running python <script_path> -h

Here's an example run command for the baseline Siamese BiLSTM:

python scripts/run_model/run_siamese.py train --share_encoder_weights --model_name=baseline_siamese --run_id=0

Here's an example run command for the Siamese BiLSTM with matching layer:

python scripts/run_model/run_siamese_matching_bilstm.py train --share_encoder_weights --model_name=siamese_matching --run_id=0

Here's an example run command for the BiMPM model:

python scripts/run_model/run_bimpm.py train --early_stopping_patience=5 --model_name=biMPM --run_id=0

Note that the defaults might not be ideal for your use, so feel free to turn the knobs however you like.

Contributors

Contributing

Do you have ideas on how to improve this repo? Have a feature request, bug report, or patch? Feel free to open an issue or PR, as I'm happy to address issues and look at pull requests.

Project Organization

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- Original immutable data (e.g. Quora Question Pairs).
|
├── logs               <- Logs from training or prediction, including TF model summaries.
│
├── models             <- Serialized models.
|
├── requirements.txt   <- The requirements file for reproducing the analysis environment
│
├── duplicate_questions<- Module with source code for models and data.
│   ├── data           <- Methods and classes for manipulating data.
│   │
│   ├── models         <- Methods and classes for training models.
│   │
│   └── util           <- Various helper methods and classes for use in models.
│
├── scripts            <- Scripts for generating the data
│   ├── data           <- Scripts to clean and split data
│   │
│   └── run_model      <- Scripts to train and predict with models.
│
└── tests              <- Directory with unit tests.

More Repositories

1

contextual-repr-analysis

A toolkit for evaluating the linguistic knowledge and transferability of contextual representations. Code for "Linguistic Knowledge and Transferability of Contextual Representations" (NAACL 2019).
Python
213
star
2

pytorch-manylinux-binaries

Shell
103
star
3

evaluating-verifiability-in-generative-search-engines

Companion repo for "Evaluating Verifiability in Generative Search Engines".
Python
76
star
4

lost-in-the-middle

Code and data for "Lost in the Middle: How Language Models Use Long Contexts"
Python
72
star
5

cython-crash-course

A quick intro to Cython for Python users who don't know C
Jupyter Notebook
32
star
6

flatten_gigaword

Dump the text of the Gigaword dataset into a single file, for use with language modeling (and other!) toolkits
Python
24
star
7

inoculation-by-finetuning

Code for the paper "Inoculation by Fine-Tuning: A Method for Analyzing Challenge Datasets", to be presented at NAACL 2019.
Jsonnet
18
star
8

lexical-semantic-recognition

Python
16
star
9

pytorch-paper-classifier

A simple model for classifying papers by academic venue (AI/ML/ACL), given a title and abstract. Bare-metal PyTorch port of https://github.com/allenai/allennlp-as-a-library-example .
Python
12
star
10

ASLSpeak

🎤 DubHacks 2015 project. Decode sign language using the Leap Motion, and speak it!
Python
9
star
11

website

HTML
9
star
12

MyoDrone

🚁 Controlling a Parrot AR 2 Drone with Thalmic Labs Myo. PennApps 2015 Spring.
JavaScript
7
star
13

word2color

Given a description of a color, return its closest standard HTML4 color.
Python
7
star
14

nlp-phd-vis

Visualizations on NLP PhD applications
Jupyter Notebook
4
star
15

BitStation-App

💸 The MIT Kerberos-integrated social wallet. Winner of BitComp 2014 Improving MIT Award.
Ruby
3
star
16

talks_and_tutorials

Repository for materials for informal and slightly less informal talks and tutorials
HTML
3
star
17

LSTMs-exploit-linguistic-attributes

Code for the paper "LSTMs Exploit Linguistic Attributes of Data", presented at the ACL 2018 Workshop on Representation Learning for NLP.
Python
3
star
18

quoref-annotation

HTML
2
star
19

SMSiri

☎️ Built @ MHacks 6 -- An SMS based natural language question and answer system.
Python
2
star
20

phonesthemes

Code for the paper "Discovering Phonesthemes with Sparse Regularization", to be presented at the NAACL 2018 Workshop on Subword and Character Level Models in NLP.
Python
2
star
21

oov-translation

Code accompanying "Augmenting Statistical Machine Translation with Subword Translation of Out-of-Vocabulary Words"
Python
1
star
22

pytorch-pretrained-bert-feedstock

A conda-smithy repository for pytorch-pretrained-bert.
Shell
1
star
23

ical2org

Convert iCal-format calendars into a file of Emacs org-mode entries
Python
1
star