• Stars
    star
    127
  • Rank 282,790 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Cascaded Text Generation with Markov Transformers

Cascaded Text Generation with Markov Transformers

Here we provide code to reproduce our results. We provide all training data and training scripts, as well as all pretrained models used in our paper with generation logs. Our code is built on top of fairseq and pytorch-struct.

Prerequisites

pip install -qU git+https://github.com/harvardnlp/pytorch-struct
pip install -qU git+https://github.com/harvardnlp/genbmm
pip install -q matplotlib
pip install -q sacremoses
pip install --editable .

Datasets & Pretrained Models & Logs

We only include IWSLT14 De-En in this repository. Other datasets/models can be found at this link. Data, models, validation and test logs for individual datasets can be found at links below.

Usage

Data Preprocessing

Throughout this Readme, we use IWSLT14 De-En as an example to show how to reproduce our results. First, we need to figure out the mapping from source length to target length. We simply use linear regression here: target_length = max-len-a * source_length + max-len-b, and we need to estimate max-len-a and max-len-b from training data. Note that directly using max-len-a=1 and max-len-b=0 would still reach reasonable performance.

python scripts/get_max_len_ab.py data/iwslt14-de-en/train.de data/iwslt14-de-en/train.en

Then we would figure out max-len-a = 0.941281036889224 and max-len-b = 0.8804326732522796.

Before training our model, we need to preprocess the training data using fairseq-preprocess.

DATASET=iwslt14-de-en
SOURCE_LANG=de
TARGET_LANG=en
TEXT=data/$DATASET
DATA_BIN=data-bin/$DATASET
fairseq-preprocess --source-lang $SOURCE_LANG --target-lang $TARGET_LANG \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir $DATA_BIN \
    --workers 20

Training

DATASET=iwslt14-de-en
DATA_BIN=data-bin/$DATASET
SAVE_DIR=checkpoints/$DATASET
ARCH=transformer_iwslt_de_en
DROPOUT=0.3
MAX_TOKENS=4096
LR=5e-4
WARMUP_UPDATES=4000
MAX_UPDATES=120000
WEIGHT_DECAY=0.0001
MAX_LEN_A=0.941281036889224
MAX_LEN_B=0.8804326732522796
CUDA_VISIBLE_DEVICES=0 fairseq-train $DATA_BIN --arch $ARCH --share-decoder-input-output-embed \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr $LR --lr-scheduler inverse_sqrt \
    --warmup-updates $WARMUP_UPDATES --dropout $DROPOUT --weight-decay $WEIGHT_DECAY \
    --criterion label_smoothed_cross_entropy --label-smoothing 0.1 --max-tokens $MAX_TOKENS \
    --eval-bleu --eval-bleu-args '{"max_len_a": '$MAX_LEN_A', "max_len_b": '$MAX_LEN_B'}' \
    --eval-tokenized-bleu --eval-bleu-remove-bpe --eval-bleu-print-samples \
    --best-checkpoint-metric bleu --maximize-best-checkpoint-metric --save-dir $SAVE_DIR \
    --max-update $MAX_UPDATES --validation-max-size 3000 \
    --validation-topk 16 --validation-D 3 --validation-rounds 5 --seed 1234 --ngrams 5

We use a single GPU to train on IWSLT14 De-En. After training is done, we can use checkpoints/iwslt14-de-en/checkpoint_best.pt for generation.

Generation

As an example, let's generate from the above trained model. We use topk = 32 and rounds = 5.

DATASET=iwslt14-de-en
DATA_BIN=data-bin/$DATASET
SAVE_DIR=checkpoints/$DATASET
BATCH_SIZE=1
TOPK=32
rounds=5
MAX_LEN_A=0.941281036889224
MAX_LEN_B=0.8804326732522796
CUDA_VISIBLE_DEVICES=0 fairseq-generate $DATA_BIN --path $SAVE_DIR/checkpoint_best.pt \
    --batch-size $BATCH_SIZE --topk $TOPK --remove-bpe --D 3 --rounds $rounds \
    --max-len-a $MAX_LEN_A --max-len-b $MAX_LEN_B

Note that using a model trained on a different dataset requires re-estimating max-len-a and max-len-b. Also, we provide another version of max-marginal computation using CUDA kernels based on tvm, which might be slightly faster depending on platform. To use it, install tvm first, then add --usetvm to the above command.

Multi-GPU Generation:

Our approach is amenable to multi-GPU parallelization: we can even get further speedup at batch size 1 using multiple GPUs.

NGPUS=3
DATASET=iwslt14-de-en
DATA_BIN=data-bin/$DATASET
SAVE_DIR=checkpoints/$DATASET
TOPK=32
rounds=5
MAX_LEN_A=0.941281036889224
MAX_LEN_B=0.8804326732522796
CUDA_VISIBLE_DEVICES=0,1,2 fairseq-generate $DATA_BIN --path $SAVE_DIR/checkpoint_best.pt \
    --batch-size 1 --topk $TOPK --remove-bpe --D 3 --rounds $rounds --ngpus $NGPUS \
    --max-len-a $MAX_LEN_A --max-len-b $MAX_LEN_B

Visualizations

More visualizations can be found at analysis/visualizations. To draw these plots (Figure 1, Figure 4 and Figure 5 in the paper), we need more dependencies.

pip install -qU git+https://github.com/da03/matplotlib.git
pip install -q mplot3d-dragger
pip install -q tqdm
apt install imagemagick

In particular, we need to use a slightly modified version of matplotlib to remove the hard-coded padding for axes.

First, we need to dump the data for visualization purposes by running the generation command with --dump-vis-path dump_vis_path (here we use 10 examples from IWSLT14 De-En validation set and topk=10):

DATASET=iwslt14-de-en
DATA_BIN=data-bin/$DATASET
SAVE_DIR=checkpoints/$DATASET
BATCH_SIZE=1
TOPK=10
rounds=5
MAX_LEN_A=0.941281036889224
MAX_LEN_B=0.8804326732522796
CUDA_VISIBLE_DEVICES=0 fairseq-generate $DATA_BIN --path $SAVE_DIR/checkpoint_best.pt \
    --batch-size $BATCH_SIZE --topk $TOPK --remove-bpe --D 3 --rounds $rounds \
    --max-len-a $MAX_LEN_A --max-len-b $MAX_LEN_B \
    --gen-subset valid --max-size 10 --seed 1234 --dump-vis-path analysis/data/dump_iwslt14_de_en_val_max10_topk10.pt

Next, we can use the following command to generate the images and animations:

python analysis/plots/visualize_3d.py --dump-vis-path analysis/data/dump_iwslt14_de_en_val_max10_topk10.pt --output-dir analysis/visualizations

We can also use the below command to print out constraint sets during decoding:

python analysis/plots/print_constraints.py --dump-vis-path analysis/data/dump_iwslt14_de_en_val_max10_topk10.pt --output-dir analysis/visualizations

Training on Other Datasets

WMT14 (raw/distilled) En-De/De-En

For preprocessing, we need to use joined dictionary.

DATASET=? # dataset dependent
SOURCE_LANG=? # dataset dependent
TARGET_LANG=? # dataset dependent
TEXT=data/$DATASET
DATA_BIN=data-bin/$DATASET
fairseq-preprocess \
    --source-lang $SOURCE_LANG --target-lang $TARGET_LANG \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir $DATA_BIN --thresholdtgt 0 --thresholdsrc 0 \
    --workers 20 --joined-dictionary

We train on 3 GPUs.

DATASET=? # dataset dependent
MAX_LEN_A=? # dataset dependent
MAX_LEN_B=? # dataset dependent
DATA_BIN=data-bin/$DATASET
SAVE_DIR=checkpoints/$DATASET
ARCH=transformer_wmt_en_de
DROPOUT=0.1
MAX_TOKENS=4096
LR=7e-4
WARMUP_UPDATES=4000
MAX_UPDATES=240000
WEIGHT_DECAY=0.0
CUDA_VISIBLE_DEVICES=0,1,2 python train.py $DATA_BIN --arch $ARCH --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
    --warmup-updates 4000 --lr $LR --min-lr 1e-09 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --weight-decay $WEIGHT_DECAY --max-tokens $MAX_TOKENS --save-dir $SAVE_DIR --update-freq 3 \
    --no-progress-bar --log-format json --log-interval 50 --save-interval-updates 1000 --dropout $DROPOUT\
    --fp16 --ddp-backend=no_c10d --eval-bleu --eval-bleu-args '{"max_len_a": '$MAX_LEN_A', "max_len_b": '$MAX_LEN_B'}'\
    --eval-tokenized-bleu --eval-bleu-remove-bpe --eval-bleu-print-samples  --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --max-update $MAX_UPDATES  --validation-max-size 3000 --validation-topk 16 --validation-D 3 --validation-rounds 5 --seed 1234

WMT16 (raw/distilled) En-Ro/Ro-En

For preprocessing, we need to use joined dictionary.

DATASET=? # dataset dependent
SOURCE_LANG=? # dataset dependent
TARGET_LANG=? # dataset dependent
TEXT=data/$DATASET
DATA_BIN=data-bin/$DATASET
fairseq-preprocess \
    --source-lang $SOURCE_LANG --target-lang $TARGET_LANG \
    --trainpref $TEXT/train --validpref $TEXT/valid --testpref $TEXT/test \
    --destdir $DATA_BIN --thresholdtgt 0 --thresholdsrc 0 \
    --workers 20 --joined-dictionary

We train on 3 GPUs.

DATASET=? # dataset dependent
MAX_LEN_A=? # dataset dependent
MAX_LEN_B=? # dataset dependent
DATA_BIN=data-bin/$DATASET
SAVE_DIR=checkpoints/$DATASET
ARCH=transformer_wmt_en_de
DROPOUT=0.3
MAX_TOKENS=5461
LR=7e-4
WARMUP_UPDATES=10000
MAX_UPDATES=120000
WEIGHT_DECAY=0.01
CUDA_VISIBLE_DEVICES=0,1,2 python train.py $DATA_BIN --arch $ARCH --share-all-embeddings \
    --optimizer adam --adam-betas '(0.9, 0.98)' --clip-norm 0.0 --lr-scheduler inverse_sqrt --warmup-init-lr 1e-07 \
    --warmup-updates $WARMUP_UPDATES --lr $LR --min-lr 1e-09 --criterion label_smoothed_cross_entropy --label-smoothing 0.1 \
    --weight-decay $WEIGHT_DECAY --max-tokens $MAX_TOKENS --save-dir $SAVE_DIR --update-freq 1 \
    --no-progress-bar --log-format json --log-interval 50 --save-interval-updates 1000 --dropout $DROPOUT\
    --fp16 --ddp-backend=no_c10d --eval-bleu --eval-bleu-args '{"max_len_a": '$MAX_LEN_A', "max_len_b": '$MAX_LEN_B'}'\
    --eval-tokenized-bleu --eval-bleu-remove-bpe --eval-bleu-print-samples --best-checkpoint-metric bleu --maximize-best-checkpoint-metric \
    --max-update $MAX_UPDATES --validation-max-size 3000 --validation-topk 16 --validation-D 3 --validation-rounds 5 --seed 1234

Citation

@inproceedings{NEURIPS2020_01a06836,
 author = {Deng, Yuntian and Rush, Alexander},
 booktitle = {Advances in Neural Information Processing Systems},
 editor = {H. Larochelle and M. Ranzato and R. Hadsell and M.F. Balcan and H. Lin},
 pages = {170--181},
 publisher = {Curran Associates, Inc.},
 title = {Cascaded Text Generation with Markov Transformers},
 url = {https://proceedings.neurips.cc/paper/2020/file/01a0683665f38d8e5e567b3b15ca98bf-Paper.pdf},
 volume = {33},
 year = {2020}
}

More Repositories

1

annotated-transformer

An annotated implementation of the Transformer paper.
Jupyter Notebook
5,683
star
2

seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
Lua
1,257
star
3

im2markup

Neural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)
Lua
1,203
star
4

pytorch-struct

Fast, general, and tested differentiable structured prediction in PyTorch
Jupyter Notebook
1,107
star
5

sent-conv-torch

Text classification using a convolutional neural network.
Lua
448
star
6

namedtensor

Named Tensor implementation for Torch
Jupyter Notebook
443
star
7

var-attn

Latent Alignment and Variational Attention
Python
326
star
8

sent-summary

300
star
9

neural-template-gen

Python
262
star
10

struct-attn

Code for Structured Attention Networks https://arxiv.org/abs/1702.00887
Lua
237
star
11

NeuralSteganography

STEGASURAS: STEGanography via Arithmetic coding and Strong neURAl modelS
Python
183
star
12

urnng

Python
176
star
13

botnet-detection

Topological botnet detection datasets and graph neural network applications
Python
169
star
14

data2text

Lua
158
star
15

sa-vae

Python
154
star
16

compound-pcfg

Python
127
star
17

TextFlow

Python
116
star
18

boxscore-data

HTML
111
star
19

decomp-attn

Decomposable Attention Model for Sentence Pair Classification (from https://arxiv.org/abs/1606.01933)
Lua
95
star
20

encoder-agnostic-adaptation

Encoder-Agnostic Adaptation for Conditional Language Generation
Python
79
star
21

genbmm

CUDA kernels for generalized matrix-multiplication in PyTorch
Jupyter Notebook
79
star
22

DeepLatentNLP

61
star
23

nmt-android

Neural Machine Translation on Android
Lua
59
star
24

BSO

Lua
54
star
25

hmm-lm

Python
42
star
26

seq2seq-talk

TeX
39
star
27

Talk-Latent

TeX
31
star
28

regulatory-prediction

Code and Data to accompany "Dilated Convolutions for Modeling Long-Distance Genomic Dependencies", presented at the ICML 2017 Workshop on Computational Biology
Python
28
star
29

harvardnlp.github.io

JavaScript
26
star
30

strux

Python
18
star
31

lie-access-memory

Lua
17
star
32

annotated-attention

Jupyter Notebook
15
star
33

DataModules

A state-less module system for torch-like languages
Python
8
star
34

rush-nlp

JavaScript
8
star
35

seq2seq-attn-web

CSS
8
star
36

tutorial-deep-latent

TeX
7
star
37

MemN2N

Torch implementation of End-to-End Memory Networks (https://arxiv.org/abs/1503.08895)
Lua
6
star
38

image-extraction

Extract images from PDFs
Jupyter Notebook
4
star
39

paper-explorer

JavaScript
3
star
40

readcomp

Entity Tracking Improves Cloze-style Reading Comprehension
Python
3
star
41

banded

Sparse banded diagonal matrices for pytorch
Cuda
2
star
42

torax

Python
2
star
43

cs6741

HTML
2
star
44

simple-recs

Python
1
star
45

poser

Python
1
star
46

iclr

1
star
47

cs6741-materials

1
star