• Stars
    star
    158
  • Rank 237,216 (Top 5 %)
  • Language
    Lua
  • Created over 7 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

data2text

Code for Challenges in Data-to-Document Generation (Wiseman, Shieber, Rush; EMNLP 2017); much of this code is adapted from an earlier fork of OpenNMT.

The boxscore-data associated with the above paper can be downloaded from the boxscore-data repo, and this README will go over running experiments on the RotoWire portion of the data; running on the SBNation data (or other data) is quite similar.

Update 2: For an improved implementation of the extractive evaluation metrics (and improved models), please see the data2text-plan-py repo associated with the Puduppully et al. (AAAI 2019) paper.

Update: models and results reflecting the newly cleaned up data in the boxscore-data repo are now given below.

Preprocessing

Before training models, you must preprocess the data. Assuming the RotoWire json files reside at ~/Documents/code/boxscore-data/rotowire, the following command will preprocess the data

th box_preprocess.lua -json_data_dir ~/Documents/code/boxscore-data/rotowire -save_data roto

and write files called roto-train.t7, roto.src.dict, and roto.tgt.dict to your local directory.

Incorporating Pointer Information

For the "conditional copy" model, it is necessary to know where in the source table each target word may have been copied from.

This pointer information can be incorporated into the preprocessing by running:

th box_preprocess.lua -json_data_dir ~/Documents/code/boxscore-data/rotowire -save_data roto -ptr_fi "roto-ptrs.txt"

The file roto-ptrs.txt has been included in the repo.

Training (and Downloading Trained Models)

The command for training the Joint Copy + Rec + TVD model is as follows:

th box_train.lua -data roto-train.t7 -save_model roto_jc_rec_tvd -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -max_batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 50 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -tanh_query -max_bptt 100 -discrec -rho 1 -partition_feats -recembsize 600 -discdist 1 -seed 0

A model trained in this way can be downloaded from https://drive.google.com/file/d/0B1ytQXPDuw7ONlZOQ2R3UWxmZ2s/view?usp=sharing

An updated model can be downloaded from https://drive.google.com/drive/folders/1QKudbCwFuj1BAhpY58JstyGLZXvZ-2w-?usp=sharing

The command for training the Conditional Copy model is as follows:

th box_train.lua -data roto-train.t7 -save_model roto_cc -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -max_batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 100 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -tanh_query -max_bptt 100 -switch -multilabel -seed 0

A model trained in this way can be downloaded from https://drive.google.com/file/d/0B1ytQXPDuw7OaHZJZjVWd2N6R2M/view?usp=sharing

An updated model can be downloaded from https://drive.google.com/drive/folders/1QKudbCwFuj1BAhpY58JstyGLZXvZ-2w-?usp=sharing

Generation

Use the following commands to generate from the above models:

th box_train.lua -data roto-train.t7 -save_model roto_jc_rec_tvd -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -max_batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 50 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -tanh_query -max_bptt 100 -discrec -rho 1 -partition_feats -recembsize 600 -discdist 1 -train_from roto_jc_rec_tvd_epoch45_7.22.t7 -just_gen -beam_size 5 -gen_file roto_jc_rec_tvd-beam5_gens.txt
th box_train.lua -data roto-train.t7 -save_model roto_cc -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -max_batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 100 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -tanh_query -max_bptt 100 -switch -multilabel -train_from roto_cc_epoch34_7.44.t7 -just_gen -beam_size 5 -gen_file roto_cc-beam5_gens.txt

The beam size used in generation can be adjusted with the -beam_size argument. You can generate on the test data by supplying the -test flag.

Misc/Utils

You can regenerate a pointer file with

python data_utils.py -mode ptrs -input_path ~/Documents/code/boxscore-data/rotowire/train.json -output_fi "my-roto-ptrs.txt"

Information/Relation Extraction

Creating Training/Validation Data

You can create a dataset for training or evaluating the relation extraction system as follows:

python data_utils.py -mode make_ie_data -input_path "../boxscore-data/rotowire" -output_fi "roto-ie.h5"

This will create files roto-ie.h5, roto-ie.dict, and roto-ie.labels.

Evaluating Generated summaries

  1. You can download the extraction models we ensemble to do the evaluation from this link. There are six models in total, with the name pattern *ie-ep*.t7. Put these extraction models in the same directory as extractor.lua. (Note that extractor.lua hard-codes the paths to these saved models, so you'll need to change this if you want to substitute in new models.)

Updated extraction models can be downloaded from https://drive.google.com/drive/folders/1QKudbCwFuj1BAhpY58JstyGLZXvZ-2w-?usp=sharing

  1. Once you've generated summaries, you can put them into a format the extraction system can consume as follows:
python data_utils.py -mode prep_gen_data -gen_fi roto_cc-beam5_gens.txt -dict_pfx "roto-ie" -output_fi roto_cc-beam5_gens.h5 -input_path "../boxscore-data/rotowire"

where the file you've generated is called roto_cc-beam5_gens.txt and the dictionary and labels files are in roto-ie.dict and roto-ie.labels respectively (as above). This will create a file called roto_cc-beam5_gens.h5, which can be consumed by the extraction system.

  1. The extraction system can then be run as follows:
th extractor.lua -gpuid 1 -datafile roto-ie.h5 -preddata roto_cc-beam5_gens.h5 -dict_pfx "roto-ie" -just_eval

This will print out the RG metric numbers. (For the recall number, divide the 'nodup correct' number by the total number of generated summaries, e.g., 727). It will also generate a file called roto_cc-beam5_gens.h5-tuples.txt, which contains the extracted relations, which can be compared to the gold extracted relations.

  1. We now need the tuples from the gold summaries. roto-gold-val.h5-tuples.txt and roto-gold-test.h5-tuples.txt have been included in the repo, but they can be recreated by repeating steps 2 and 3 using the gold summaries (with one gold summary per-line, as usual).

  2. The remaining metrics can now be obtained by running:

python non_rg_metrics.py roto-gold-val.h5-tuples.txt roto_cc-beam5_gens.h5-tuples.txt

Retraining the Extraction Model

I trained the convolutional IE model as follows:

th extractor.lua -gpuid 1 -datafile roto-ie.h5 -lr 0.7 -embed_size 200 -conv_fc_layer_size 500 -dropout 0.5 -savefile roto-convie

I trained the BLSTM IE model as follows:

th extractor.lua -gpuid 1 -datafile roto-ie.h5 -lstm -lr 1 -embed_size 200 -blstm_fc_layer_size 700 -dropout 0.5 -savefile roto-blstmie -seed 1111

The saved models linked to above were obtained by varying the seed or the epoch.

Updated Results

On the development set:

RG (P% / #) CS (P% / R%) CO PPL BLEU
Gold 95.98 / 16.93 100 / 100 100 1 100
Template 99.93 / 54.21 23.42 / 72.62 11.30 N/A 8.97
Joint+Rec+TVD (B=1) 61.23 / 15.27 28.79 / 39.80 15.27 7.26 12.69
Conditional (B=1) 76.66 / 12.88 37.98 / 35.46 16.70 7.29 13.60
Joint+Rec+TVD (B=5) 62.84 / 16.77 27.23 / 40.60 14.47 7.26 13.44
Conditional (B=5) 75.74 / 16.93 31.20 / 38.94 14.98 7.29 14.57

On the test set:

RG (P% / #) CS (P% / R%) CO PPL BLEU
Gold 96.11 / 17.31 100 / 100 100 1 100
Template 99.95 / 54.15 23.74 / 72.36 11.68 N/A 8.93
Joint+Rec+TVD (B=5) 62.66 / 16.82 27.60 / 40.59 14.57 7.49 13.61
Conditional (B=5) 75.62 / 16.83 32.80 / 39.93 15.62 7.53 14.19

More Repositories

1

annotated-transformer

An annotated implementation of the Transformer paper.
Jupyter Notebook
5,683
star
2

seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
Lua
1,257
star
3

im2markup

Neural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)
Lua
1,203
star
4

pytorch-struct

Fast, general, and tested differentiable structured prediction in PyTorch
Jupyter Notebook
1,107
star
5

sent-conv-torch

Text classification using a convolutional neural network.
Lua
448
star
6

namedtensor

Named Tensor implementation for Torch
Jupyter Notebook
443
star
7

var-attn

Latent Alignment and Variational Attention
Python
326
star
8

sent-summary

300
star
9

neural-template-gen

Python
262
star
10

struct-attn

Code for Structured Attention Networks https://arxiv.org/abs/1702.00887
Lua
237
star
11

NeuralSteganography

STEGASURAS: STEGanography via Arithmetic coding and Strong neURAl modelS
Python
183
star
12

urnng

Python
176
star
13

botnet-detection

Topological botnet detection datasets and graph neural network applications
Python
169
star
14

sa-vae

Python
154
star
15

compound-pcfg

Python
127
star
16

cascaded-generation

Cascaded Text Generation with Markov Transformers
Python
127
star
17

TextFlow

Python
116
star
18

boxscore-data

HTML
111
star
19

decomp-attn

Decomposable Attention Model for Sentence Pair Classification (from https://arxiv.org/abs/1606.01933)
Lua
95
star
20

encoder-agnostic-adaptation

Encoder-Agnostic Adaptation for Conditional Language Generation
Python
79
star
21

genbmm

CUDA kernels for generalized matrix-multiplication in PyTorch
Jupyter Notebook
79
star
22

DeepLatentNLP

61
star
23

nmt-android

Neural Machine Translation on Android
Lua
59
star
24

BSO

Lua
54
star
25

hmm-lm

Python
42
star
26

seq2seq-talk

TeX
39
star
27

Talk-Latent

TeX
31
star
28

regulatory-prediction

Code and Data to accompany "Dilated Convolutions for Modeling Long-Distance Genomic Dependencies", presented at the ICML 2017 Workshop on Computational Biology
Python
28
star
29

harvardnlp.github.io

JavaScript
26
star
30

strux

Python
18
star
31

lie-access-memory

Lua
17
star
32

annotated-attention

Jupyter Notebook
15
star
33

DataModules

A state-less module system for torch-like languages
Python
8
star
34

rush-nlp

JavaScript
8
star
35

seq2seq-attn-web

CSS
8
star
36

tutorial-deep-latent

TeX
7
star
37

MemN2N

Torch implementation of End-to-End Memory Networks (https://arxiv.org/abs/1503.08895)
Lua
6
star
38

image-extraction

Extract images from PDFs
Jupyter Notebook
4
star
39

paper-explorer

JavaScript
3
star
40

readcomp

Entity Tracking Improves Cloze-style Reading Comprehension
Python
3
star
41

banded

Sparse banded diagonal matrices for pytorch
Cuda
2
star
42

torax

Python
2
star
43

cs6741

HTML
2
star
44

simple-recs

Python
1
star
45

poser

Python
1
star
46

iclr

1
star
47

cs6741-materials

1
star