• Stars
    star
    258
  • Rank 158,189 (Top 4 %)
  • Language
    Jupyter Notebook
  • Created over 7 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for Document Similarity on Reuters dataset using Encode, Embed, Attend, Predict recipe

eeap-examples

Table of Contents

Introduction

This repository contains some examples of applying the Embed, Encode, Attend, Predict (EEAP) recipe proposed by Matthew Honnibal, creator of the SpaCy deep learning toolkit, for building Deep Learning pipelines.

I also gave a talk about this at my talk at PyData Seattle 2017.

Code is in Python. All models are built using the awesome Keras library. Supporting code uses NLTK and Scikit-Learn.

The examples use 4 custom Attention layers, also available here as a Python include file. The examples themselves are written as Jupyter notebooks.

A good complete implementation of attention can be found here.

Data

Please refer to data/README.md for instructions on how to download the data necessary to run these examples.

Examples

Document Classification Task

The document classification task attempts to build a classification model for documents by treating it as a sequence of sentences, and sentences as sequence of words. We start with the bag of words approach, computing document embeddings as an average of its sentence embeddings, and sentence embeddings as an average of its word embeddings. Next we build a hierarchical model for building sentence embeddings using a bidirectional LSTM, and embed this model within one that builds document embeddings by encoding the output of this model using another bidirectional LSTM. Finally we add attention layers to each level (sentence and document model). Our final model is depicted in the figure below:

The models were run against the Reuters 20 Newsgroups data in order to classify a given document into one of 20 classes. The chart below shows the results of running these different experiments. The interesting value here is the test set accuracy, but we have shown training and validation set accuracies as well for completeness.

As you can see, the accuracy rises from about 71% for the bag of words model to about 82% for the hierarchical model that incorporates the Matrix Vector Attention models.


Document Similarity Task

The Document Similarity task uses a nested model similar to the document classification task, where the sentence model generates a sentence embedding from a sequence of word embeddings, and a document model embeds the sentence model to generate a document embedding. A pair of such networks are set up to produce document vectors from the documents being compared, and the concatenated vector fed into a fully connected network to predict a binary (similar / not similar) outcome.

The dataset for this was manufactured from the Reuters 20 newsgroup dataset. TF-IDF vectors were generated for all 10,000 test set documents, and the similarity between all pairs of these vectors were calculated. Then the top 5 percentile was selected as the positive set and the bottom 5 percentile as the negative set. Even so, there does not appear to be too much differentiation, similarity values differed by about 0.2 between the two sets. A 1% sample was then drawn from either set to make the training set for this network.

We built two models, one without attention at either the sentence or document layer, and one with attention on the document layer. Results are shown below:


Sentence Similarity Task

The Sentence Similarity task uses the Semantic Similarity Task dataset from 2012. The objective is to classify a pair of sentences into a continuous scale of similarity from 0 to 5. We build a regression network as shown below. Our loss function is Mean Squared Error and Optimizer is RMSProp. Evaluation is done by computing the RMSE between the label similarity and the network predictions of the test set. In addition, we also compute the Pearson and Spearman (rank) correlations between the labels and predictions of the test set.

Our baseline is a hierarchical network that computes an encoding for each sentence in the pair, where the encodings without attention are used to generate the prediction. We compare the baseline to Matrix Matrix dot attention proposed by Parikh, et al where the inputs are scaled to [-1, 1] (MM-dot(s)). Next we compare with an unscaled version of this (MM-dot). Finally, we introduce two new attention implementations based on a description on this Tensorflow NMT page - specifically, an additive attention (MM-add) proposed by Bahdanau, et al, and a multiplicative attention (MM-mult) proposed by Luong, et al. Both operate on the encoder outputs without scaling via tanh. Results are shown below. As can be seen, the MM-add and MM-mult result in lower RMSE and generally higher Pearson and Spearman correlations than the baseline.

More Repositories

1

statlearning-notebooks

Python notebooks for exercises covered in Stanford statlearning class (where exercises were in R).
376
star
2

dl-models-for-qa

Keras DL models to answer 8th grade science multiple choice questions (Kaggle AllenAI competition).
Python
236
star
3

nltk-examples

Worked examples from the NLTK Book
Python
183
star
4

holiday-similarity

Finding similar images in the Holidays dataset
Jupyter Notebook
103
star
5

fttl-with-keras

Transfer Learning and Fine Tuning for Cross Domain Image Classification with Keras
Jupyter Notebook
83
star
6

ner-re-with-transformers-odsc2022

Building NER and RE components using HuggingFace Transformers
Jupyter Notebook
47
star
7

hia-examples

Hadoop In Action Examples
Java
39
star
8

mlia-examples

Python and R Examples
Python
39
star
9

pytorch-gnn-tutorial-odsc2021

Repository for GNN tutorial using Pytorch and Pytorch Geometric (PyG) for ODSC 2021
Jupyter Notebook
36
star
10

reuters-docsim

Different approaches to computing document similarity
Python
28
star
11

mia-scala-examples

Mahout Examples
Scala
26
star
12

keras-tutorial-odsc2020

Notebooks for Keras Tutorial presented at ODSC West 2020
Jupyter Notebook
26
star
13

ltr-examples

Supporting code for Learning to Rank (LTR) presentation
Jupyter Notebook
16
star
14

polydlot

My attempt to learn more than one Deep Learning framework
Jupyter Notebook
16
star
15

intro-dl-talk-code

Jupyter notebooks and code for Intro to DL talk at Genesys
Jupyter Notebook
14
star
16

scalcium

Scala NLP Algorithms
Scala
10
star
17

solr4-extras

Random solr4 customizations
Scala
10
star
18

delsym

An actor based content ingestion pipeline
Scala
10
star
19

nlp-graph-examples

Examples for Graphorum 2019 presentation -- Graph Techniques for Natural Language Processing
Jupyter Notebook
10
star
20

esc

Scala client for ElasticSearch
Scala
9
star
21

deeplearning-ai-examples

Jupyter Notebook
8
star
22

bpwj

Java Parser Development Framework from Steven Metsker's "Building Parsers With Java book"
Java
8
star
23

thinkstats-examples

Worked examples for exercises in Think Stats using the Scientific Python stack.
Jupyter Notebook
8
star
24

saturn-scispacy

SaturnCloud notebooks to extract annotations from CORD-19 dataset using SciSpacy pretrained models
Jupyter Notebook
8
star
25

llm-rag-eval

Large Language Model (LLM) powered evaluator for Retrieval Augmented Generation (RAG) pipelines.
Python
8
star
26

content-engineering-tutorial

Jupyter Notebook
7
star
27

kg-aligned-entity-linker

Knowledge Graph Aligned Entity Linker using BERT and Sentence Transformers
Jupyter Notebook
7
star
28

vespa-poc

Small Proof of Concept to familiarize myself with Vespa.ai functionality
Python
7
star
29

neural-re-experiments

Jupyter Notebook
5
star
30

bayesian-stats-examples

Python versions of things taught in the Bayesian Statistics courses on Coursera
Jupyter Notebook
4
star
31

neurips-papers-node2vec

Jupyter Notebook
3
star
32

tgni

Experimental NER techniques to address common (for me) text analysis problems.
Java
3
star
33

snorkel-pytorch-lstm-gpu

Code for my GPU port of Snorkel 's Pytorch discriminative model (LSTM)
Python
2
star
34

compmethods-notebooks

Python Notebooks for the Computional Methods for Data Analysis course on Coursera.
2
star
35

claimintel

Descriptive Stats on Claims Data
Scala
1
star
36

misc-docs

Account for storing miscellaneous text files for sharing
1
star
37

spark-data-algorithms

Implementations of common data algorithms in Spark
1
star
38

sherpa

Django based web application to help with organizing a conference (summit)
Python
1
star
39

pytorch-drl-examples

Reimplementation of Deep Reinforcement Learning examples from "Deep Reinforcement Learning with Python" by Sudharsan Ravichandran
1
star