• Stars
    star
    146
  • Rank 251,541 (Top 5 %)
  • Language
    Perl
  • License
    Other
  • Created over 6 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tools for downloading and analyzing summaries and evaluating summarization systems. https://summari.es/

Installation Instructions

Newsroom requires Python 3.6+ and can be installed using pip:

pip install -e git+git://github.com/clic-lab/newsroom.git#egg=newsroom

Getting the Data

There are two ways to obtain the summaries dataset. You may use the scripts described below to scrape the web pages used in the dataset and extract the summaries. Alternatively, the complete dataset is also available from https://summari.es/download/.

Data Processing Tools

Newsroom contains two scripts for downloading and processing data downloaded from Archive.org. First, download the "Thin Dataset" from https://summari.es/download/. (The "Data Builder" is this Python package.) Download and extract thin.tar with tar xvf thin.tar, yielding directory thin containing several .jsonl.gz files.

Next, use newsroom-scrape and newsroom-extract to process the data, as described below. Both of these tools have additional argument help pages when you use the --help command line option.

Data Scraping

The thin directory will contain three files, train.jsonl.gz, dev.jsonl.gz and test.jsonl.gz. To begin downloading the development set from Archive.org, run the following:

newsroom-scrape --thin thin/dev.jsonl.gz --archive dev.archive

Estimated download time is indicated with a progress bar. If errors occur during downloading, you may need to re-run the script later to capture the missing articles. This process is network bound and depends mostly on Archive.org, save your CPU cycles for the extraction stage!

The downloading process can be stopped at any time with Control-C and resumed later. It is also possible to perform extraction of a partially downloaded dataset with newsroom-extract before continuing to download the full version.

Data Extraction

The newsroom-extract tool extracts summaries and article text from the data downloaded by newsroom-scrape. This tool produces a new file that does not modify the original output file of newsroom-scrape, and can be run with:

newsroom-extract --archive dev.archive --dataset dev.dataset

The script automatically parallelizes extraction across your CPU cores. To disable this or reduce the number of cores used, use the --workers option. Like scraping, the extraction process can be stopped at any point with Control-C and resumed later.

Reading and Analyzing the Data

All data are represented using gzip-compressed JSON lines. The Newsroom package provides an easy tool to read an write these files — and do so up to 20x faster than the standard Python gz and json packages!

from newsroom import jsonl

# Read entire file:

with jsonl.open("train.dataset", gzip = True) as train_file:
    train = train_file.read()

# Read file entry by entry:

with jsonl.open("train.dataset", gzip = True) as train_file:
    for entry in train_file:
        print(entry["summary"], entry["text"])

Extraction Analysis

The Newsroom package also contains scripts for identifying extractive fragments and computing metrics described in the paper: coverage, density, and compression.

import random

from newsroom import jsonl
from newsroom.analyze import Fragments

with jsonl.open("train.dataset", gzip = True) as train_file:
    train = train_file.read()

# Compute stats on random training example:

entry = random.choice(train)
summary, text = entry["summary"], entry["text"]
fragments = Fragments(summary, text)

# Print paper metrics:

print("Coverage:",    fragments.coverage())
print("Density:",     fragments.density())
print("Compression:", fragments.compression())

# Extractive fragments oracle:

print("List of extractive fragments:")
print(fragments.strings())

Evaluation Tools

The Newsroom package contains a standardized way for running and scoring Docker-based summarization systems. For an example, see the /example directory for a Docker image of the TextRank system used in the paper.

The package also contains a script for producing tables similar to those in the paper for compression, coverage, and density. These tables are helpful for understanding your system's performance across different difficulties of text-summary pairs.

Running Your System

After starting Docker and building your image (named "textrank" in the following examples), the system can be evaluated using the script:

newsroom-run \
    --system textrank \              # Name of Docker image.
    --dataset dev.dataset \          # Path to evaluation data.
    --summaries textrank.summaries \ # Output path to write system summaries.
    --keys text                      # JSON keys to feed Docker system.

The script runs your system Docker image, passes article text (and other requested metadata) into the container through standard input, expecting summaries to be supplied on standard output.

Scoring Your System

To score your system, run the following:

newsroom-score \
    --dataset dev.dataset \          # Path to evaluation data.
    --summaries textrank.summaries \ # Path to system's output summaries.
    --scores textrank.scores \       # Output path to write summary scores.
    --rouge 1,2,L \                  # ROUGE variants to run.
    --unstemmed                      # Or, --stemmed for Porter stemming.

The script produces a file (textrank.scores) containing pairs of system and reference summaries, article metadata for analysis, and ROUGE scores. Additionally, overall ROUGE scores are printed on completion.

Producing Output Tables

To produce ROUGE tables across Newsroom compression, density, and coverage subsets, run the following:

newsroom-tables \
    --scores textrank.scores \
    --rouge 1,2,L \
    --variants fscore \
    --bins density,compression,coverage

All command line tools have a --help flag that show a description of arguments and their defaults.

More Repositories

1

nlvr

Cornell NLVR and NLVR2 are natural language grounding datasets. Each example shows a visual input and a sentence describing it, and is annotated with the truth-value of the sentence.
HTML
252
star
2

spf

Cornell Semantic Parsing Framework
Java
128
star
3

touchdown

Cornell Touchdown natural language navigation and spatial reasoning dataset.
Python
92
star
4

kilogram

The KiloGram Tangrams dataset
Jupyter Notebook
50
star
5

atis

Python
46
star
6

chalet

Cornell House Agent Learning Environment
HTML
46
star
7

blocks

Blocks World -- Simulator, Code, and Models (Misra et al. EMNLP 2017)
Python
40
star
8

ciff

Cornell Instruction Following Framework
Python
33
star
9

drif

Dynamic Robot Instruction Following
Python
31
star
10

cerealbar

Cereal Bar is a two-player web game designed for studying language understanding agents in collaborative interactions. This repository contains code for the game, a webapp hosting the game, the agent implementation, and recorded interactions in the game. http://lil.nlp.cornell.edu/cerealbar/
Python
28
star
11

amr

Cornell AMR Semantic Parser (Artzi et al., EMNLP 2015)
Java
23
star
12

bandit-qa

Code for Simulating Bandit Learning from User Feedback for Extractive Question Answering.
Python
18
star
13

nccg

Neural Shift Reduce Parser for CCG Semantic Parsing (Misra and Artzi, EMNLP 2016)
Java
17
star
14

lm-class

Materials for a language modeling class, broadly construed
NewLisp
16
star
15

cb2

An NLP research and data collection platform.
Python
14
star
16

vgnsl_analysis

"What is Learned in Visually Grounded Neural Syntax Acquisition", Noriyuki Kojima, Hadar Averbuch-Elor, Alexander Rush and Yoav Artzi (ACL 2020)
Python
12
star
17

navi

Code for Weakly Supervised Learning of Semantic Parsers for Mapping Instructions to Actions (Artzi and Zettlemoyer, TACL 2013)
Java
11
star
18

navigation-corpus

Navigation data used for Chen and Mooney 2011 and Artzi and Zettlemoyer 2013 (including cleaned up oracle data)
Python
9
star
19

qa-from-hf

Python
9
star
20

dynet_tutorials

Contains various short notebooks showing how to use DyNet. Created for CS 5740 at Cornell University.
Jupyter Notebook
8
star
21

scone

Python
7
star
22

lilgym

lilGym RL benchmark
Python
7
star
23

recnet

A human-driven recommendation system for academic readings.
TypeScript
3
star
24

gsmn

Code for RSS2018 paper on the Grounded Semantic Mapping Network
2
star
25

cerealbar_generation

Python
1
star
26

geoquery-corpus

The GeoQuery corpus
1
star
27

phrase_grounding

Python
1
star
28

kilogram-annotation-task

Task website for collecting tangram annotations from MTurk.
JavaScript
1
star
29

lilgym-baselines

Python
1
star
30

remote-teaching-setup

Remote teaching and talk recording setup
1
star