• Stars
    star
    237
  • Rank 168,973 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 8 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Keras DL models to answer 8th grade science multiple choice questions (Kaggle AllenAI competition).

dl-models-for-qa

âš  WARNING: As pointed out recently by a colleague, the 75% accuracies achieved by the QA models described in this project could have been achieved by a classifier that learns to always return false (since 3 of 4 of the training set consists of false answers). He took the evaluation one step further and compared the softmax probabilities associated with each of the 4 possible answers against the correct answer, and achieved accuracies around 25%, once again indicative of a model that is only as good as random (1 in 4 answers are correct).

Table of Contents

Introduction

This repository contains Python code to train and deploy Deep Learning (DL) models for Question Answering (QA). The code accompanies a talk I gave at the Question Answering Workshop organized by the Elsevier Search Guild.

You can find the slides for this talk on Slideshare.

Code is in Python. All the models are built using the awesome Keras library. Supporting code also uses gensim, NLTK and Spacy.

Objective of the code was to build DL model(s) to answer 8th grade multiple-choice science questions, provided as part of this AllenAI competition on Kaggle.

Models

Much of the inspiration for the DL implementations in this project came from the solution posted by the 4th place winner of the competition, who used DL models along with traditional Information Retrieval (IR) models.

Models using bAbI dataset

In order to gain some intuition about how to use DL for QA, I looked at two examples from the Keras examples, that use the single supporting fact (task #1) from the bAbI dataset created by Facebook. These two models are described below:

The bAbI dataset can be thought of as (story, question, answer) triples. In case of task #1, the answer is always a single word. Figure below illustrates the data format for task #1.

Both models attempt to predict the answer as the most probable word from the entire vocabulary.

BABI-LSTM

Implementation based on paper Towards AI-Complete Question Answering: A set of Prerequisite Toy Tasks - Weston, et al. Adapted from similar example in Keras examples.

Embedding is computed inline using story and patient. Observed accuracy (56%) is similar to that reported in paper (50%).

BABI-MEMNN

Implementation based on paper End to End Memory Networks - Sukhbaatar, Szlam, Weston and Fergus. Adapted from similar example in Keras examples.

Accuracy achieved by implementation (on 1k triples) is around 42% compared to 99% reported in paper.

Models using Kaggle dataset

From this point on, all our models use the competition dataset. A training set record is composed of (question, answer_A, answer_B, answer_C, answer_D, correct_answer) tuples. Objective is to predict the index of the correct answer.

We can thus think of this as a classification problem, where we have 1 positive example and 3 negative examples for each training record.

QA-LSTM

Implementation based on paper LSTM-based Deep Learning Models for Non-factoid Answer Selection - Tan, dos Santos, Xiang and Zhou.

Unlike bABi models, embedding uses pre-trained Google News Word2Vec model to convert story and question input vector (1 hot sparse representation) into dense representation of size (300,).

Accuracy numbers from implementation are 56.93% with unidirectional LSTMs and 57% with bidirectional LSTMs.

QA-LSTM CNN

Same as qa-lstm, but with an additional 1D Convolution/MaxPool layer to further extract the meaning of the question and answer.

Produces slightly worse accuracy numbers than QA-LSTM model - 55.7% with unidirectional LSTMs, did not try with bidirectional LSTMs.

QA-LSTM with Attention

Problem with RNNs in general is the vanishing gradient problem. While LSTMs address the problem, they still suffer from it because of the very long distances involved in QA contexts. The solution to this is attention, where the network is forced to look at certain parts of the context and ignore (in a relative sense) everything else.

Incorporating External Knowledge

Based on the competition message boards, there seems to be general consensus that external content is okay to use. Here are some mentioned:

Most of the contents mentioned involve quite a lot of effort to scrape/crawl the sites and parse the crawled content. There was one content source (Flashcards from StudyStack) that was available here in pre-parsed form, so I used that. This gave me 400k flashcard records, questions followed by the correct answer. I thought of this as the "story" from the bAbI context.

QA-LSTM with Attention and Custom Embedding

My first attempt at incorporating the story was to replace the embedding from the pre-trained Word2Vec model with a Word2Vec model generated using the Flashcard data. This created a smaller, more compact embedding and gave me quite a good boost in accuracy.

Model Default Embedding Story Embedding
QA-LSTM w/Attention 62.93% 76.27%
QA-LSTM bidirectional w/Attention 60.43% 76.27%

The qa-lstm-fem-attn model(s) are identical to the qa-lstm-attn model(s) except for the embedding used - instead of the default embedding from Word2Vec, I am now using a custom embedding from the flashcard data.

QA-LSTM with Story

My second attempt at incorporating the story data was to try to create (story, question, answer) triples similar to the bAbI models. The first step is to load the flashcards into an Elasticsearch (ES) index, one flashcard per record. For each question, the nouns and verbs are filtered and an OR query constructed and sent to ES. The top 10 flashcards retrieved for each question become the story for that triple.

Once I have the "story" associated with our (question, answer) pairs, I construct a network as shown below. This model did not perform as well as the QA-LSTM with Attention models, accuracy was 70.47% with unidirectional LSTMs and 61.77% with bidirectional LSTMs.

Results

Results from the various QA-LSTM variants against the Kaggle dataset is summarized below.

Model Specifications Test Acc. (%)
QA-LSTM (Baseline) 56.93
QA-LSTM Bidirectional 57.0
QA-LSTM + CNN 55.7
QA-LSTM with Attention 62.93
QA-LSTM Bidirectional with Attention 60.43
QA-LSTM with Attention + Custom Embedding 76.27
QA-LSTM Bidirectional w/Attention + Custom Embedding 76.27
QA-LSTM + Attention + Story Facts 70.47
QA-LSTM Bidirectional + Attention + Story Facts 61.77

Data

Data is not included in this project. However, most of the data is available on the Internet, I have included links to the data where applicable. The code expects the following directory structure.

PROJECT_HOME
   |
   +---- data
   |       |
   |       +---- babi_data
   |       |
   |       +---- comp_data
   |       |
   |       +---- models

The bAbI dataset is available from [this URL]. Download it and expand the tarball under the babi_data directory.

My code uses the original dataset provided along with the competition, which is no longer available (and cannot be distributed). However, AllenAI provides an alternative dataset which can be used instead. These files need to be copied into the comp_data subdirectory. Note that the format of the new data is slightly different, but fortunately well documented, so you will have to adapt the parsing logic in kaggle.py. Look for the following verbiage to find the correct dataset to download.

AI2 8th Grade Science Questions (No Diagrams)

641 questions February 2016 These question sets are derived from a variety of regional and state science exams.

These science exam questions guide our research into multiple choice question answering at the elementary science level. This download contains 8th grade-level multiple choice questions that do not incorporate diagrams.

The comp_data directory should also contain the GoogleNews Word2Vec model, which is needed to load the default word vectors. In addition, the StudyStack Flashcards should also be downloaded and exploded in the same directory.

The models directory is used to hold the models that are written out by the different models when they run. The deploy code uses these models to make predictions. Models are not checked into github because of space considerations.

Model deployment

For deployment, we run each question + answer pair and consider the difference between the positive and negative outputs as the "score". The score is then normalized to sum to 1 and the one with the highest score selected as the winning choice.

As an example, the following image shows the actual predictions for a question and answer from our dataset. The chart shows the probability of each choice according to our strongest model.

More Repositories

1

statlearning-notebooks

Python notebooks for exercises covered in Stanford statlearning class (where exercises were in R).
376
star
2

eeap-examples

Code for Document Similarity on Reuters dataset using Encode, Embed, Attend, Predict recipe
Jupyter Notebook
259
star
3

nltk-examples

Worked examples from the NLTK Book
Python
183
star
4

holiday-similarity

Finding similar images in the Holidays dataset
Jupyter Notebook
103
star
5

fttl-with-keras

Transfer Learning and Fine Tuning for Cross Domain Image Classification with Keras
Jupyter Notebook
83
star
6

ner-re-with-transformers-odsc2022

Building NER and RE components using HuggingFace Transformers
Jupyter Notebook
47
star
7

hia-examples

Hadoop In Action Examples
Java
39
star
8

mlia-examples

Python and R Examples
Python
39
star
9

pytorch-gnn-tutorial-odsc2021

Repository for GNN tutorial using Pytorch and Pytorch Geometric (PyG) for ODSC 2021
Jupyter Notebook
36
star
10

reuters-docsim

Different approaches to computing document similarity
Python
28
star
11

mia-scala-examples

Mahout Examples
Scala
26
star
12

keras-tutorial-odsc2020

Notebooks for Keras Tutorial presented at ODSC West 2020
Jupyter Notebook
26
star
13

ltr-examples

Supporting code for Learning to Rank (LTR) presentation
Jupyter Notebook
16
star
14

polydlot

My attempt to learn more than one Deep Learning framework
Jupyter Notebook
16
star
15

intro-dl-talk-code

Jupyter notebooks and code for Intro to DL talk at Genesys
Jupyter Notebook
14
star
16

scalcium

Scala NLP Algorithms
Scala
10
star
17

solr4-extras

Random solr4 customizations
Scala
10
star
18

delsym

An actor based content ingestion pipeline
Scala
10
star
19

nlp-graph-examples

Examples for Graphorum 2019 presentation -- Graph Techniques for Natural Language Processing
Jupyter Notebook
10
star
20

esc

Scala client for ElasticSearch
Scala
9
star
21

deeplearning-ai-examples

Jupyter Notebook
8
star
22

bpwj

Java Parser Development Framework from Steven Metsker's "Building Parsers With Java book"
Java
8
star
23

thinkstats-examples

Worked examples for exercises in Think Stats using the Scientific Python stack.
Jupyter Notebook
8
star
24

saturn-scispacy

SaturnCloud notebooks to extract annotations from CORD-19 dataset using SciSpacy pretrained models
Jupyter Notebook
8
star
25

llm-rag-eval

Large Language Model (LLM) powered evaluator for Retrieval Augmented Generation (RAG) pipelines.
Python
8
star
26

content-engineering-tutorial

Jupyter Notebook
7
star
27

vespa-poc

Small Proof of Concept to familiarize myself with Vespa.ai functionality
Python
7
star
28

neural-re-experiments

Jupyter Notebook
5
star
29

kg-aligned-entity-linker

Knowledge Graph Aligned Entity Linker using BERT and Sentence Transformers
Jupyter Notebook
5
star
30

bayesian-stats-examples

Python versions of things taught in the Bayesian Statistics courses on Coursera
Jupyter Notebook
4
star
31

neurips-papers-node2vec

Jupyter Notebook
3
star
32

tgni

Experimental NER techniques to address common (for me) text analysis problems.
Java
3
star
33

snorkel-pytorch-lstm-gpu

Code for my GPU port of Snorkel 's Pytorch discriminative model (LSTM)
Python
2
star
34

compmethods-notebooks

Python Notebooks for the Computional Methods for Data Analysis course on Coursera.
2
star
35

claimintel

Descriptive Stats on Claims Data
Scala
1
star
36

misc-docs

Account for storing miscellaneous text files for sharing
1
star
37

spark-data-algorithms

Implementations of common data algorithms in Spark
1
star
38

sherpa

Django based web application to help with organizing a conference (summit)
Python
1
star
39

pytorch-drl-examples

Reimplementation of Deep Reinforcement Learning examples from "Deep Reinforcement Learning with Python" by Sudharsan Ravichandran
1
star