• Stars
    star
    262
  • Rank 156,136 (Top 4 %)
  • Language
    Python
  • Created about 6 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

neural-template-gen

Code for Learning Neural Templates for Text Generation (Wiseman, Shieber, Rush; EMNLP 2018)

For questions/concerns/bugs please feel free to email swiseman[at]ttic.edu.

N.B. This code was tested with python 2.7 and pytorch 0.3.1.

Data and Data Preparation

The E2E NLG Challenge data is available here, and the preprocessed version of the data used for training is at data/e2e_aligned.tar.gz. This preprocessed data uses the same database record preprocessing scheme applied by Sebastian Gehrmann in his system, and also annotates text spans that occur in the corresponding database. Code for annotating the data in this way is at data/make_e2e_labedata.py.

The WikiBio data is available here, and the preprocessed version of the target-side data used for training is at data/wb_aligned.tar.gz. This target-side data is again preprocessed to annotate spans appearing in the corresponding database. Code for this annotation is at data/make_wikibio_labedata.py. The source-side data can be downloaded directly from the WikiBio repo, and we used it unchanged; in particular the *.box files become our src_*.txt files mentioned below.

The code assumes that each dataset lives in a directory containing src_train.txt, train.txt, src_valid.txt, and valid.txt files, and that if the files are from the WikiBio dataset the directory name will contain the string wiki.

Training

The four trained models mentioned in the paper can be downloaded here. The commands for retraining the models are given below.

Assuming your E2E data is in data/labee2e/, you can train the non-autoregressive model as follows

python chsmm.py -data data/labee2e/ -emb_size 300 -hid_size 300 -layers 1 -K 55 -L 4 -log_interval 200 -thresh 9 -emb_drop -bsz 15 -max_seqlen 55 -lr 0.5 -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 5 -mlpinp -onmt_decay -cuda -seed 1818 -save models/chsmm-e2e-300-55-5.pt

and the autoregressive model as follows.

python chsmm.py -data data/labee2e/ -emb_size 300 -hid_size 300 -layers 1 -K 55 -L 4 -log_interval 200 -thresh 9 -emb_drop -bsz 15 -max_seqlen 55 -lr 0.5 -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 5 -mlpinp -onmt_decay -cuda -seed 1111 -save models/chsmm-e2e-300-55-5-far.pt -ar_after_decay

Assuming your WikiBio data is in data/labewiki, you can train the non-autoregressive model as follows

python chsmm.py -data data/labewiki/ -emb_size 300 -hid_size 300 -layers 1 -K 45 -L 4 -log_interval 1000 -thresh 29 -emb_drop -bsz 5 -max_seqlen 55 -lr 0.5 -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 3 -mlpinp -onmt_decay -cuda -save models/chsmm-wiki-300-45-3.pt

and the autoregressive model as follows.

python chsmm.py -data data/labewiki/ -emb_size 300 -hid_size 300 -layers 1 -K 45 -L 4 -log_interval 1000 -thresh 29 -emb_drop -bsz 5 -max_seqlen 55 -lr 0.5 -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 3 -mlpinp -onmt_decay -cuda -save models/chsmm-wiki-300-45-3-war.pt -ar_after_decay -word_ar

The above scripts will also attempt to save to a models/ directory (which must be created first). Also see chsmm.py for additional training and model options.

N.B. training is somewhat sensitive to the random seed, and it may be necessary to try several seeds in order to get the best performance.

Viterbi Segmentation/Template Extraction

Once you've trained a model, you can use it to compute the Viterbi segmentation of the training data, which we use to extract templates. A gzipped tarball containing Viterbi segmentations corresponding to the four models above can be downloaded here.

You can rerun the segmentation for the non-autoregressive E2E model as follows

python chsmm.py -data data/labee2e/ -emb_size 300 -hid_size 300 -layers 1 -K 55 -L 4 -log_interval 200 -thresh 9 -emb_drop -bsz 16 -max_seqlen 55 -lr 0.5  -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 5 -mlpinp -onmt_decay -cuda -load models/e2e-55-5.pt -label_train | tee segs/seg-e2e-300-55-5.txt

and for the autoregressive one as follows.

python chsmm.py -data data/labee2e/ -emb_size 300 -hid_size 300 -layers 1 -K 60 -L 4 -log_interval 200 -thresh 9 -emb_drop -bsz 16 -max_seqlen 55 -lr 0.5  -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 1 -mlpinp -onmt_decay -cuda -load models/e2e-60-1-far.pt -label_train -ar_after_decay | tee segs/seg-e2e-300-60-1-far.txt

You can rerun the segmentation for the non-autoregressive WikiBio model as follows

python chsmm.py -data data/labewiki/ -emb_size 300 -hid_size 300 -layers 1 -K 45 -L 4 -log_interval 200 -thresh 29 -emb_drop -bsz 16 -max_seqlen 55 -lr 0.5  -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 3 -mlpinp -onmt_decay -cuda -load models/wb-45-3.pt -label_train | tee segs/seg-wb-300-45-3.txt

and for the autoregressive one as follows.

python chsmm.py -data data/labewiki/ -emb_size 300 -hid_size 300 -layers 1 -K 45 -L 4 -log_interval 200 -thresh 29 -emb_drop -bsz 16 -max_seqlen 55 -lr 0.5  -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 3 -mlpinp -onmt_decay -cuda -load models/wb-45-3-war.pt -label_train | tee segs/seg-wb-300-45-3-war.txt

The above scripts write the MAP segmentations (in text) to standard out. Above, they have been redirected to a segs/ directory.

Examining and Extracting Templates

The template_extraction.py script can be used to extract templates from the segmentations produced as above, and to look at them. In particular, extract_from_tagged_data() returns the most common templates, and mappings from these templates to sentences, and from states to phrases. This script is also used in generation (see below).

Generation

Once a model has been trained and the MAP segmentations created, we can generate by limiting to (for instance) the top 100 extracted templates.

The following command will generate on the E2E validation set using the autoregressive model:

python chsmm.py -data data/labee2e/ -emb_size 300 -hid_size 300 -layers 1 -dropout 0.3 -K 60 -L 4 -log_interval 100 -thresh 9 -lr 0.5 -sep_attn -unif_lenps -emb_drop -mlpinp -onmt_decay -one_rnn -max_pool -gen_from_fi data/labee2e/src_uniq_valid.txt -load models/e2e-60-1-far.pt -tagged_fi segs/seg-e2e-60-1-far.txt -beamsz 5 -ntemplates 100 -gen_wts '1,1' -cuda -min_gen_tokes 0 > gens/gen-e2e-60-1-far.txt

The following command will generate on the WikiBio test using the autoregressive model:

python chsmm.py -data data/labewiki/ -emb_size 300 -hid_size 300 -layers 1 -K 45 -L 4 -log_interval 1000 -thresh 29 -emb_drop -bsz 5 -max_seqlen 55 -lr 0.5 -sep_attn -max_pool -unif_lenps -one_rnn -Kmul 3 -mlpinp -onmt_decay -cuda -gen_from_fi wikipedia-biography-dataset/test/test.box -load models/wb-45-3-war.pt -tagged_fi segs/seg-wb-300-45-3-war.txt -beamsz 5 -ntemplates 100 -gen_wts '1,1' -cuda -min_gen_tokes 20 > gens/gen-wb-45-3-war.txt

Generations from the other models can be obtained analogously, by substituting in the correct arguments for -data (path to data directory), -gen_from_fi (the source file from which to generate), -load (path to the saved model), and -tagged_fi (path to the MAP segmentations under the corresponding model). See chsmm.py for additional generation options.

N.B. The format of the generations is: <generation>|||<segmentation>, where <segmentation> provides the segmentation used in generating. As such, all the text beginning with '|||' should be stripped off before evaluating the generations.

More Repositories

1

annotated-transformer

An annotated implementation of the Transformer paper.
Jupyter Notebook
5,683
star
2

seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
Lua
1,257
star
3

im2markup

Neural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)
Lua
1,203
star
4

pytorch-struct

Fast, general, and tested differentiable structured prediction in PyTorch
Jupyter Notebook
1,107
star
5

sent-conv-torch

Text classification using a convolutional neural network.
Lua
448
star
6

namedtensor

Named Tensor implementation for Torch
Jupyter Notebook
443
star
7

var-attn

Latent Alignment and Variational Attention
Python
326
star
8

sent-summary

300
star
9

struct-attn

Code for Structured Attention Networks https://arxiv.org/abs/1702.00887
Lua
237
star
10

NeuralSteganography

STEGASURAS: STEGanography via Arithmetic coding and Strong neURAl modelS
Python
183
star
11

urnng

Python
176
star
12

botnet-detection

Topological botnet detection datasets and graph neural network applications
Python
169
star
13

data2text

Lua
158
star
14

sa-vae

Python
154
star
15

compound-pcfg

Python
127
star
16

cascaded-generation

Cascaded Text Generation with Markov Transformers
Python
127
star
17

TextFlow

Python
116
star
18

boxscore-data

HTML
111
star
19

decomp-attn

Decomposable Attention Model for Sentence Pair Classification (from https://arxiv.org/abs/1606.01933)
Lua
95
star
20

encoder-agnostic-adaptation

Encoder-Agnostic Adaptation for Conditional Language Generation
Python
79
star
21

genbmm

CUDA kernels for generalized matrix-multiplication in PyTorch
Jupyter Notebook
79
star
22

DeepLatentNLP

61
star
23

nmt-android

Neural Machine Translation on Android
Lua
59
star
24

BSO

Lua
54
star
25

hmm-lm

Python
42
star
26

seq2seq-talk

TeX
39
star
27

Talk-Latent

TeX
31
star
28

regulatory-prediction

Code and Data to accompany "Dilated Convolutions for Modeling Long-Distance Genomic Dependencies", presented at the ICML 2017 Workshop on Computational Biology
Python
28
star
29

harvardnlp.github.io

JavaScript
26
star
30

strux

Python
18
star
31

lie-access-memory

Lua
17
star
32

annotated-attention

Jupyter Notebook
15
star
33

DataModules

A state-less module system for torch-like languages
Python
8
star
34

rush-nlp

JavaScript
8
star
35

seq2seq-attn-web

CSS
8
star
36

tutorial-deep-latent

TeX
7
star
37

MemN2N

Torch implementation of End-to-End Memory Networks (https://arxiv.org/abs/1503.08895)
Lua
6
star
38

image-extraction

Extract images from PDFs
Jupyter Notebook
4
star
39

paper-explorer

JavaScript
3
star
40

readcomp

Entity Tracking Improves Cloze-style Reading Comprehension
Python
3
star
41

banded

Sparse banded diagonal matrices for pytorch
Cuda
2
star
42

torax

Python
2
star
43

cs6741

HTML
2
star
44

simple-recs

Python
1
star
45

poser

Python
1
star
46

iclr

1
star
47

cs6741-materials

1
star