• This repository has been archived on 27/May/2018
  • Stars
    star
    107
  • Rank 313,946 (Top 7 %)
  • Language
    Lua
  • Created almost 9 years ago
  • Updated about 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Multi-Perspective Convolutional Neural Networks for modeling textual similarity (He et al., EMNLP 2015)

Multi-Perspective Convolutional Neural Networks for Modeling Textual Similarity

NOTE: This repo contains code for the original Torch implementation from the EMNLP 2015 paper. The code is not being maintained anymore and has been superseded by a PyTorch reimplementation in Castor. This repo exists solely for archival purposes.

This repo contains the Torch implementation of multi-perspective convolutional neural networks for modeling textual similarity, described in the following paper:

This model does not require external resources such as WordNet or parsers, does not use sparse features, and achieves good accuracy on standard public datasets.

Installation and Dependencies

  • Please install Torch deep learning library. We recommend this local installation which includes all required packages our tool needs, simply follow the instructions here: https://github.com/torch/distro

  • Currently our tool only runs on CPUs, therefore it is recommended to use INTEL MKL library (or at least OpenBLAS lib) so Torch can run much faster on CPUs.

  • Our tool then requires Glove embeddings by Stanford. Please run fetech_and_preprocess.sh for downloading and preprocessing this data set (around 3 GBs).

Running

  • Command to run (training, tuning and testing all included):
  • th trainSIC.lua or th trainMSRVID.lua

The tool will output pearson scores and also write the predicted similarity scores given each pair of sentences from test data into predictions directory.

Adaption to New Dataset

To run our model on your own dataset, first you need to build the dataset following below format and put it under data folder:

  • a.toks: sentence A, each sentence per line.
  • b.toks: sentence B, each sentence per line.
  • id.txt: sentence pair ID
  • sim.txt: semantic relatedness gold label, can be in any scale. For binary classification, the set of labels will be {0, 1}.

Then build vocabulary for your dataset which writes the vocab-cased.txt into your data folder:

$ python build_vocab.py

The last thing is to change the training and model code slightly to process your dataset:

  • change util/read_data.lua to handle your data.
  • create a new piece of training code following trainSIC.lua to read in your dataset.
  • change Conv.lua in Line 89-102 and 142-148 to handle your own task
  • more details can refer to issue #6

Then you should be able to run your training code.

Trained Model

We also porvide a model which is already trained on STS dataset. So it is easier if you just want to use the model and do not want to re-train the whole thing.

The tarined model download link is HERE. Model file size is 500MB. To use the trained model, then simply use codes below:

modelTrained = torch.load("download_local_location/modelSTS.trained.th", 'ascii')
modelTrained.convModel:evaluate()
modelTrained.softMaxC:evaluate()
local linputs = torch.zeros(rigth_sentence_length, emd_dimension)
linpus = XassignEmbeddingValuesX
local rinputs = torch.zeros(left_sentence_length, emd_dimension)
rinpus = XassignEmbeddingValuesX

local part2 = modelTrained.convModel:forward({linputs, rinputs})
local output = modelTrained.softMaxC:forward(part2)
local val = torch.range(0, 5, 1):dot(output:exp()) 
return val/5

The ouput variable 'val' contains a similarity score between [0,1]. The input linputs1/rinputs are torch tensors and you need to fill in the word embedding values for both.

Example Deployment Script with Our Trained Model

We provide one example file for deployment: testDeployTrainedModel.lua. So it is easier for you to directly use our model. Run:

$ th testDeployTrainedModel.lua

This deployment file will use the trained model (assume you have downloaded the trained model from the above link), and it will generate scores given all test sentences of sick dataset. Please note the trained model is not trained on SICK data.

Ackowledgement

We thank Kai Sheng Tai for providing the preprocessing codes. We also thank the public data providers and Torch developers. Thanks.

More Repositories

1

pyserini

Pyserini is a Python toolkit for reproducible information retrieval research with sparse and dense representations.
Python
1,488
star
2

anserini

Anserini is a Lucene toolkit for reproducible information retrieval research
Java
982
star
3

daam

Diffusion attentive attribution maps for interpreting Stable Diffusion.
Jupyter Notebook
608
star
4

hedwig

PyTorch deep learning models for document classification
Python
586
star
5

honk

PyTorch implementations of neural network models for keyword spotting
Python
504
star
6

docTTTTTquery

docTTTTTquery document expansion model
Python
346
star
7

pygaggle

a gaggle of deep neural architectures for text ranking and question answering, designed for Pyserini
Jupyter Notebook
323
star
8

BuboQA

Simple question answering over knowledge graphs (Mohammed et al., NAACL 2018)
Python
280
star
9

rank_llm

Repository for prompt-decoding using LLMs (GPT3.5, GPT4, Vicuna, and Zephyr)
Python
247
star
10

howl

Wake word detection modeling toolkit for Firefox Voice, supporting open datasets like Speech Commands and Common Voice.
Python
191
star
11

castor

PyTorch deep learning models for text processing
Python
180
star
12

DeeBERT

DeeBERT: Dynamic Early Exiting for Accelerating BERT Inference
Python
150
star
13

birch

Document ranking via sentence modeling using BERT
Python
142
star
14

covidex

A multi-stage neural search engine for the COVID-19 Open Research Dataset
TypeScript
136
star
15

duobert

Multi-stage passage ranking: monoBERT + duoBERT
Python
109
star
16

anserini-notebooks

Anserini notebooks
Jupyter Notebook
69
star
17

mr.tydi

Mr. TyDi is a multi-lingual benchmark dataset built on TyDi, covering eleven typologically diverse languages.
Python
68
star
18

honkling

Web app for keyword spotting using TensorflowJS
JavaScript
67
star
19

afriberta

AfriBERTa: Exploring the Viability of Pretrained Multilingual Language Models for Low-resourced Languages
Python
60
star
20

data

Castorini data
Python
59
star
21

dhr

Dense hybrid representations for text retrieval
Python
55
star
22

NCE-CNN-Torch

Noise-Contrastive Estimation for Question Answering with Convolutional Neural Networks (Rao et al. CIKM 2016)
Lua
54
star
23

chatty-goose

A Python framework for conversational search
Python
38
star
24

transformers-arithmetic

Python
38
star
25

d-bert

Distilling BERT using natural language generation.
Python
35
star
26

hf-spacerini

Plug-and-play Search Interfaces with Pyserini and Hugging Face
Python
30
star
27

SimpleDBpediaQA

simple QA over knowledge graphs on DBpedia
Python
25
star
28

bertserini

BERTserini
Python
24
star
29

berxit

Python
21
star
30

anserini-tools

Evaluation tools shared across anserini, pyserini, and pygaggle
Python
21
star
31

onboarding

Onboarding guide to Jimmy Lin's research group at the University of Waterloo
21
star
32

VDPWI-NN-Torch

Very Deep Pairwise Word Interaction Neural Networks for modeling textual similarity (He and Lin, NAACL/HLT 2016)
Lua
19
star
33

TREC-COVID

TREC-COVID results - this is a mirror of data on the TREC website in a more convenient format.
Roff
14
star
34

perm-sc

Official codebase for permutation self-consistency.
Python
14
star
35

LiT5

Python
13
star
36

honk-models

Pre-trained models for Honk
11
star
37

howl-deploy

JavaScript deployment for Howl, the wake word detection modeling toolkit for Firefox Voice
JavaScript
10
star
38

TrecQA-NegEx

Code and dataset for SIGIR 2017 short paper "Automatically Extracting High-Quality Negative Examples for Answer Selection in Question Answering"
Python
10
star
39

Tweets2013-IA

The Tweets2013 Internet Archive collection
Scala
10
star
40

AfriTeVa-keji

Python
10
star
41

meanmax

MeanMax estimators.
Python
9
star
42

cqe

Python
9
star
43

SM-CNN-Torch

Torch implementation of Severyn and Moschitti's SIGIR 2015 CNN model for question answering
Lua
9
star
44

ONNX-demo

Python
8
star
45

anserini-notebooks-afirm2020

Colab notebooks for AFIRM '20
Jupyter Notebook
7
star
46

serverless-bert-reranking

Python
7
star
47

parrot

Keyword spotting using audio from speech synthesis services and YouTube
Python
7
star
48

earlyexiting-monobert

Python
7
star
49

afriteva

Text - 2 - Text for African languages
Python
6
star
50

tct_colbert

Python
5
star
51

transformers-selective

Python
5
star
52

serverless-inference

Neural network inference on serverless architecture
Python
5
star
53

norbert

NorBERT: Anserini + dl4marco-bert
Python
4
star
54

rank_llm_data

3
star
55

touche-error-analysis

Old is Gold? Systematic Error Analysis of Neural Retrieval Models against BM25 for Argument Retrieval
Python
3
star
56

numbert

Passage Ranking Library using various pretrained LMs
Python
3
star
57

anserini-spark

Anserini-Spark integration
Java
3
star
58

kim-cnn-vis

An in-browser visualization of Kim CNN
JavaScript
3
star
59

replicate-lce

Python
3
star
60

kws-gen-data

Data for KWS generator.
2
star
61

pyserini-data

Python
2
star
62

candle

PyTorch utilities for parameter pruning and multiplies reduction
Python
2
star
63

BuboQA-models

2
star
64

gooselight2

Search frontend for Anserini
Ruby
2
star
65

africlirmatrix

AfriCLIRMatrix is a test collection for cross-lingual information retrieval research in 15 diverse African languages.
2
star
66

biasprobe

Python
2
star
67

sigtestv

SIGnificance TESTing Violations: an end-to-end toolkit for evaluating neural networks.
Python
1
star
68

howl-models

1
star
69

SolrAnserini

Anserini integration with Solr
Python
1
star
70

gooselight

๐Ÿฆ† Anserini + Blacklight ๐Ÿฆ†
Ruby
1
star
71

BuboQA-data

Hosting dataset for BuboQA
1
star
72

anlessini

Java
1
star
73

honkling-models

JavaScript
1
star