• Stars
    star
    108
  • Rank 321,259 (Top 7 %)
  • Language
    Python
  • Created over 6 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An extractive neural network text summarization library for the EMNLP 2018 paper "Content Selection in Deep Learning Models of Summarization" (https://arxiv.org/abs/1810.12343).

nnsum

An extractive neural network text summarization library for the EMNLP 2018 paper Content Selection in Deep Learning Models of Summarization (https://arxiv.org/abs/1810.12343).

Installation

  1. Install pytorch using pip or conda.
  2. run:
git clone https://github.com/kedz/nnsum.git
cd nnsum
python setup.py install
  1. Get the data:
git clone https://github.com/kedz/summarization-datasets.git
cd summarization-datasets
python setup.py install

See README.md in summarization-datasets for details on how to get each dataset from the paper.

Training A Model

All models from the paper can be trained from the same convenient training script: script_bin/train_model.py. The general pattern for usage is:

python script_bin/train_model.py \
  --trainer TRAINER_ARGS --emb EMBEDDING_ARGS \
  --enc ENCODER_ARGS --ext EXTRACTOR_ARGS

Every model has a set of word embeddings, a sentence encoder, and a sentence extractor. Each argument section allows you to pick an architecture/options for that component. For the most part, defaults match the paper's primary evaluation settings. For example, to train the CNN encoder with Seq2Seq extractor on gpu 0, run the following:

python script_bin/train_model.py \
    --trainer --train-inputs PATH/TO/INPUTS/TRAIN/DIR \
              --train-labels PATH/TO/LABELS/TRAIN/DIR \
              --valid-inputs PATH/TO/INPUTS/VALID/DIR \
              --valid-labels PATH/TO/LABELS/VALID/DIR \
              --valid-refs PATH/TO/HUMAN/REFERENCE/VALID/DIR \
              --weighted \
              --gpu 0 \
              --model PATH/TO/SAVE/MODEL \
              --results PATH/TO/SAVE/VALIDATION/SCORES \
              --seed 12345678 \
    --emb --pretrained-embeddings PATH/TO/200/DIM/GLOVE \
    --enc cnn \
    --ext s2s --bidirectional 

Trainer Arguments

These arguments set the data to train on, batch sizes, training epochs, etc.

  • --train-inputs PATH Path to training inputs directory. (required)
  • --train-labels PATH Path to training labels directory. (required)
  • --valid-inputs PATH Path to validation inputs directory. (required)
  • --valid-labels PATH Path to validation labels directory. (required)
  • --valid-refs PATH Path to validation reference summaries directory. (required)
  • --seed SEED Random seed for getting repeatable results. Due to randomness in cudnn drivers the CNN sentence encoder will still have randomness.
  • --epochs EPOCHS Number of training epochs to run. (Default: 50)
  • --batch-size BATCH_SIZE Batch size for sgd. (Default: 32)
  • --gpu GPU Select gpu device. When set to -1, use cpu. (Default: -1)
  • --teacher-forcing EPOCHS Number of epochs to use teacher forcing. After the EPOCHS training epoch, teacher forcing is disabled. Only effects the Cheng & Lapata extractor. (Default: 25)
  • --sentence-limit LIMIT Only load the first LIMIT sentences of each document. Useful for saving memory/compute time when some of the data contains outlier documents with long length. (Default: 50)
  • --weighted When this flag is used, upweight positive labels to make them proportional to the negative labels. This is helpful because this is an inbalanced classification problem and we only want to select 3 or 4 sentences out of 30-50 sentences.
  • --summary-length LENGTH Set the maximum summary word length for ROUGE evaluation.
  • --remove-stopwords When flag is set, ROUGE is computed with stopwords removed.
  • --model PATH Location to save model. (Optional)
  • --results PATH Location to save results json. Stores per epoch training cross entropy, and per epoch validation cross-entropy, rouge-1, and rouge-2. (optional)

Embedding Arguments

These arguments set the word embeddings size, the path to pretrained embeddings, whether to fix the embeddings during learning, etc.

  • --embedding-size SIZE Dimenstion of word embeddings. Must match the size of pretrained embeddings if using pretrained. (Default: 200)
  • --pretrained-embeddings PATH Path to pretrained embeddings. Embeddings file should be a text file each line in the format: word e1 e2 e3... where e1, e2, ... are the values of the embedding at that dimension. (optional)
  • --top-k TOP_K Only keep the TOP_K most frequent words in the embedding vocabulary where counts are based on the training dataset. Default keeps all words. Only active when not using pretrained embeddings. (Default: None)
  • --at-least AT_LEAST Keep only words that occur at least AT_LEAST times in the training dataset. Default keeps all words that occur at least once. Only active when not using pretrained embeddings. (Default: 1)
  • --word-dropout PROB Apply dropout to tokens in the input document with drop prob PROB. (Default: 0)
  • --embedding-dropout PROB Apply dropout to the word embeddings with drop prob PROB. (default: .25)
  • --update-rule RULE Options are update-all and fix-all. update-all treats the embeddings as parameters to be learned during training. fix-all prevents word embeddings from being updated. (Default: fix-all)
  • --filter-pretrained When flag is set, apply --top-k and --at-least filters to pretrained embedding vocabulary.

Encoder Arguments

The encoder arguments select for one of three sentence encoder architectures: avg, rnn, or cnn, and their various parameters e.g. --enc avg or --enc cnn CNN_ARGUMENTS. The sentence encoder takes a sentence, i.e. an arbitrarily long sequence of word embeddings and encodes them as a fixed length embedding. Below we describe the options for each architecture.

Averaging Encoder

A sentence embedding is simply the average of the word embeddings.

  • --dropout PROB Drop probability applied after averaging the word embeddings. (Default: .25)

RNN Encoder

A sentence embedding is the final output state of an RNN applied to the word embedding sequence.

  • --hidden-size SIZE Size of the RNN output layer (and hidden if using LSTM cell). (Default: 200)
  • --bidirectional When flag is set, use a bi-directional RNN. Output is the concatenation of the final forward and backward outputs.
  • --dropout PROB Drop propability applied to output layers of the RNN. (Default: .25)
  • --num-layers NUM Number layers in the RNN. (Default: 1)
  • --cell CELL RNN cell type to use. Options are rnn, gru, or lstm. (Default: gru)

CNN Encoder

A sentence embedding is the output of a convolutional + relu + dropout layer. The output size is the sum of the --feature-maps argument.

  • --dropout PROB Drop probability applied to convolutional layer output. (Default: .25)
  • --filter-windows WIN [WIN ...] Filter window widths, i.e. size in ngrams of each convolutional feature window. Number of args must be the same length as --feature-maps. (Default: 1 2 3 4 5 6)
  • --feature-maps MAPS [MAPS ...] Number of convolutional feature maps. Number of args must be the same length as --filter-windows. (Default: 25 25 50 50 50 50)

Extractor Arguments

The extractor arguments select for one of four sentence extractor architectures: rnn, s2s, cl, or sr, and their various parameters e.g. --ext cl or --enc s2s S2S_ARGUMENTS. The sentence extractor takes an arbitrarily long sequence of sentence embeddings and predicts whether each sentence should be included in the summary. Below we describe the options for each architecture.

RNN Extractor

Sentence embeddings are run through an RNN and then fed into a multi-layer perceptron MLP to predict sentence extraction.

  • --hidden-size SIZE Size of the RNN output layer/hidden layer. (Default: 300)
  • --bidirectional When flag is set, use a bi-directional RNN.
  • --rnn-dropout PROB Drop propability applied to output layers of the RNN. (Default: .25)
  • --num-layers NUM Number of layers in the RNN. (Default: 1)
  • --cell CELL RNN cell type to use. Options are rnn, gru, or lstm. (Default: gru)
  • --mlp-layers SIZE [SIZE ...] A list of sizes of the hidden layers in the MLP. Must be the same length as --mlp-dropouts. (Default: 100)
  • --mlp-dropouts PROB [PROBS ...] A list the MLP hidden layer dropout probabilities. Must be the same length as --mlp-layers. (Default: .25)

Seq2Seq Extractor

Sentence embeddings are run through a seq2seq based extractor with attention and MLP layer for predicting sentence extraction.

  • --hidden-size SIZE Size of the RNN output layer/hidden layer. (Default: 300)
  • --bidirectional When flag is set, use a bi-directional RNN.
  • --rnn-dropout PROB Drop propability applied to output layers of the RNN. (Default: .25)
  • --num-layers NUM Number of layers in the RNN. (Default: 1)
  • --cell CELL RNN cell type to use. Options are rnn, gru, or lstm. (Default: gru)
  • --mlp-layers SIZE [SIZE ...] A list of sizes of the hidden layers in the MLP. Must be the same length as --mlp-dropouts. (Default: 100)
  • --mlp-dropouts PROB [PROBS ...] A list the MLP hidden layer dropout probabilities. Must be the same length as --mlp-layers. (Default: .25)

Cheng & Lapata Extractor

This is an implementation of the sentence extractive summarizer from: https://arxiv.org/abs/1603.07252

  • --hidden-size SIZE Size of the RNN output layer/hidden layer. (Default: 300)
  • --bidirectional When flag is set, use a bi-directional RNN.
  • --rnn-dropout PROB Drop propability applied to output layers of the RNN. (Default: .25)
  • --num-layers NUM Number of layers in the RNN. (Default: 1)
  • --cell CELL RNN cell type to use. Options are rnn, gru, or lstm. (Default: gru)
  • --mlp-layers SIZE [SIZE ...] A list of sizes of the hidden layers in the MLP. Must be the same length as --mlp-dropouts. (Default: 100)
  • --mlp-dropouts PROB [PROBS ...] A list the MLP hidden layer dropout probabilities. Must be the same length as --mlp-layers. (Default: .25)

SummaRunner Extractor

This is an implementation of the sentence extractive summarizer from: https://arxiv.org/abs/1611.04230

  • --hidden-size SIZE Size of the RNN output layer/hidden layer. (Default: 300)
  • --bidirectional When flag is set, use a bi-directional RNN.
  • --rnn-dropout PROB Drop propability applied to output layers of the RNN. (Default: .25)
  • --num-layers NUM Number of layers in the RNN. (Default: 1)
  • --cell CELL RNN cell type to use. Options are rnn, gru, or lstm. (Default: gru)
  • --sentence-size SIZE Dimension of sentence representation (after RNN layer) (Default: 100)
  • --document-size SIZE Dimension of the document representation (Default: 100)
  • --segments SEG Number of coarse position chunks, e.g. 4 segments means sentences in first quarter of the document would get the same segment embedding, the second quarter and so on. (Default: 4)
  • --max-position-weights NUM The number of unique sentence position embeddings to use. The first NUM sentences will get unique sentence position embeddings. Documents longer than NUM will have the last sentence position embedding repeated. (Default: 50)
  • --segment-size SIZE Dimension of segment embeddings. (Default: 16)
  • --position-size SIZE Dimension of position embeddings. (Default: 16)

Evaluating a Model

To get a model's ROUGE scores on the train, validation, or test set use script_bin/eval_model.py.

E.g.:

python eval_model.py \
  --inputs PATH/TO/INPUTS/DIR \
  --refs PATH/TO/REFERENCE/DIR \
  --model PATH/TO/MODEL \
  --results PATH/TO/WRITE/RESULTS \
  --summary-length 100 

Eval script parameters are described below:

  • --batch-size BATCH_SIZE Batch size to use when generating summaries. (Default: 32)
  • --gpu GPU Which gpu device to use. When GPU == -1, use cpu. (Default: -1)
  • --sentence-limit LIMIT Only read in the first LIMIT sentences in the document to be summarized. By default there is no limit. (Default: None)
  • --summary-length LENGTH The summary word length to use in ROUGE evauation. (Default: 100)
  • --remove-stopwords When flag is true ROUGE evaluation is done with stopwords removed.
  • --inputs PATH Path to input data directory. (required)
  • --refs PATH Path to human reference summary directory. (required)
  • --model PATH Path to saved model to evaluate. (required)
  • --results PATH Path to write results json. Stores per document ROUGE scores and dataset average scores. (optional)

More Repositories

1

summarization-datasets

Pre-processing and in some cases downloading of datasets for the paper "Content Selection in Deep Learning Models of Summarization."
Python
78
star
2

sumpy

SUMPY: a python automatic text summarization library
Python
54
star
3

newsblaster

Group workspace for improvements to the Columbia Newsblaster system.
CSS
31
star
4

corenlp

A python wrapper for the Stanford CoreNLP java library.
Python
17
star
5

duc_preprocess

Library for pre-processing DUC summarization data.
Python
7
star
6

cuttsum

Columbia University - Trec 2015 - Temporal Summarization
Python
5
star
7

wtmf

Numpy implementation of Weiwei Guo's Weighted Textual Matrix Factorization in sklearn framework.
Python
5
star
8

noiseylg

Code repository for INLG 19 paper A Good Sample is Hard to Find: Noise Injection Sampling and Self-Training for Neural Language Generation Models. Arxiv paper: https://arxiv.org/abs/1911.03373
Python
4
star
9

learn2sum

Learning2search (LOLS) for multi-document summarization.
Python
3
star
10

discourse

OpenEdge ABL
2
star
11

styleeq

Code for Low-Level Linguistic Controls for Style Transfer and Content Preservation
Python
2
star
12

rouge_papier

A ROUGE wrapper in python.
Perl
2
star
13

spensum

Extractive text summarization using structured prediction energy networks.
Python
2
star
14

disasters

Python
1
star
15

LDAx10

Latent Dirichlet Allocation implemented in x10.
1
star
16

newsblaster-lite

An easier newsblaster starting point with fewer dependencies.
Python
1
star
17

NewsBlasterToolkit

1
star
18

nnmemory

Diferentiable data structures for torch
Lua
1
star
19

python-nlp

A place for some fast basic nlp implementations in c along with a python binding.
C
1
star
20

coherence

Python
1
star
21

thesis_proposal

My thesis proposal documents and slides.
TeX
1
star
22

EventQA

EventQA system originally developed by Shreya Prasad for the DARPA BOLT project at Columbia NLP lab.
Java
1
star