• Stars
    star
    111
  • Rank 314,627 (Top 7 %)
  • Language
    HTML
  • Created over 7 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data used in Challenges in Data-to-Document Generation (Wiseman, Shieber, Rush; EMNLP 2017). If you use this data, please cite the above paper.

Update (9/3/20): Please consider using the SportSett:Basketball dataset rather than the standard Rotowire dataset described below. Among other things, SportSett:Basketball corrects some dataset contamination issues, where box- and line-scores appear in multiple splits.

Update (1/22/18): Thanks to @janenie for pointing out that some of the line-scores in the data (which report team-level stats) had the team names flipped. Player-level information was not affected. These examples have now been unflipped.

Data

This dataset consists of (human-written) NBA basketball game summaries aligned with their corresponding box- and line-scores. Summaries taken from rotowire.com are referred to as the "rotowire" data, and summaries taken from sbnation.com (and associated team-specific sites) are referred to as the "sbnation" data; we treat these sub-datasets separately, since they are quite different.

To extract the data, run tar -jxvf rotowire.tar.bz2 to form a rotowire/ directory (and similarly for sbnation.tar.bz2).

Rotowire Data

The rotowire data can be found in rotowire/[train|valid|test].json. There are 4853 distinct rotowire summaries, covering NBA games played between 1/1/2014 and 3/29/2017; some games have multiple summaries. The summaries have been randomly split into training, validation, and test sets consisting of 3398, 727, and 728 summaries, respectively.

SBNation Data

The sbnation data can be found in sbnation/[train|valid|test].json. There are 10903 distinct rotowire summaries, covering NBA games played between 11/3/2006 and 3/26/2017; some games have multiple summaries. The summaries have been randomly split into training, validation, and test sets consisting of 7633, 1635, and 1635 summaries, respectively.

Data Format

Each file is utf-8 encoded json, and contains a list of json objects corresponding to each aligned summary/data pair. These json objects have the following fields:

  • home_name - Name of home team (unicode)
  • home_city - City of home team (unicode)
  • vis_name - Name of visiting team (unicode)
  • vis_city - City of visiting team (unicode)
  • day - Date of game in %MM_%DD_%YY format (unicode)
  • summary - Tokenized summary of game
  • home_line - Home team line-score object; see below
  • vis_line - Visiting team line-score object; see below
  • box_score - Box-score object; see below

Line-score Objects

Line-score objects have the following fields:

  • TEAM-NAME - Team name (unicode)
  • TEAM-CITY - Team city (unicode)
  • TEAM-AST - Number of team assists (integer as unicode)
  • TEAM-FG3_PCT - Percentage of 3 pointers made by team (integer as unicode)
  • TEAM-FG_PCT - Percentage of field goals made by team (integer as unicode)
  • TEAM-FT_PCT - Percentage of free throws made by team (integer as unicode)
  • TEAM_LOSSES - Team losses (integer as unicode)
  • TEAM-PTS - Total team points (integer as unicode)
  • TEAM-PTS_QTR1 - Team points in first quarter (integer as unicode)
  • TEAM-PTS_QTR2 - Team points in second quarter (integer as unicode)
  • TEAM-PTS_QTR3 - Team points in third quarter (integer as unicode)
  • TEAM-PTS_QTR4 - Team points in fourth quarter (integer as unicode)
  • TEAM-REB - Total team rebounds (integer as unicode)
  • TEAM-TOV - Total team turnovers (integer as unicode)
  • TEAM-WINS - Team wins (integer as unicode)

Box-score Objects

Box-score objects contain (column) objects mapping row numbers to values. Rows are numbered from 0 to at most 25, and each row corresponds to a player in the game. In particular, a box-score object contains the following column objects:

  • AST - Player assists (row_number -> integer as unicode)
  • BLK - Player blocks (row_number -> integer as unicode)
  • DREB - Player defensive rebounds (row_number -> integer as unicode)
  • FG3A - Player 3-pointers attempted (row_number -> integer as unicode)
  • FG3M - Player 3-pointers made (row_number -> integer as unicode)
  • FG3_PCT - Player 3-pointer percentage (row_number -> integer as unicode)
  • FGA - Player field goals attempted (row_number -> integer as unicode)
  • FGM - Player field goals made (row_number -> integer as unicode)
  • FG_PCT - Player field goal percentage (row_number -> integer as unicode)
  • FIRST_NAME - Player first name (row_number -> unicode)
  • FTA - Player free throws attempted (row_number -> integer as unicode)
  • FTM - Player free throws made (row_number -> integer as unicode)
  • FT_PCT - Player free throw percentage (row_number -> integer as unicode)
  • MIN - Player minutes played (row_number -> integer as unicode)
  • OREB - Player offensive rebounds (row_number -> integer as unicode)
  • PF - Player personal fouls (row_number -> integer as unicode)
  • PLAYER_NAME - Player full name (row_number -> integer as unicode)
  • PTS - Player points (row_number -> integer as unicode)
  • REB - Player total rebounds (row_number -> integer as unicode)
  • SECOND_NAME - Player second name (row_number -> integer as unicode)
  • START_POSITION - Player position (row_number -> unicode)
  • STL - Player steals (row_number -> integer as unicode)
  • TEAM_CITY - Player team city (row_number -> unicode)
  • TO - Player turnovers (row_number -> integer as unicode)

Preprocessing Details

Box- and Line-scores

All number values in the box- and line-scores have been converted to integers by rounding if necessary. (So, percents are given as integers between 0 and 100).

Summaries

Summaries are tokenized using nltk, and hyphenated phrases are separated. Tweets and photos were removed from the sbnation summaries, as were any paragraphs that did not contain at least 2 numbers (in either numeric or verbal form).

More Repositories

1

annotated-transformer

An annotated implementation of the Transformer paper.
Jupyter Notebook
5,683
star
2

seq2seq-attn

Sequence-to-sequence model with LSTM encoder/decoders and attention
Lua
1,257
star
3

im2markup

Neural model for converting Image-to-Markup (by Yuntian Deng yuntiandeng.com)
Lua
1,203
star
4

pytorch-struct

Fast, general, and tested differentiable structured prediction in PyTorch
Jupyter Notebook
1,107
star
5

sent-conv-torch

Text classification using a convolutional neural network.
Lua
448
star
6

namedtensor

Named Tensor implementation for Torch
Jupyter Notebook
443
star
7

var-attn

Latent Alignment and Variational Attention
Python
326
star
8

sent-summary

300
star
9

neural-template-gen

Python
262
star
10

struct-attn

Code for Structured Attention Networks https://arxiv.org/abs/1702.00887
Lua
237
star
11

NeuralSteganography

STEGASURAS: STEGanography via Arithmetic coding and Strong neURAl modelS
Python
183
star
12

urnng

Python
176
star
13

botnet-detection

Topological botnet detection datasets and graph neural network applications
Python
169
star
14

data2text

Lua
158
star
15

sa-vae

Python
154
star
16

compound-pcfg

Python
127
star
17

cascaded-generation

Cascaded Text Generation with Markov Transformers
Python
127
star
18

TextFlow

Python
116
star
19

decomp-attn

Decomposable Attention Model for Sentence Pair Classification (from https://arxiv.org/abs/1606.01933)
Lua
95
star
20

encoder-agnostic-adaptation

Encoder-Agnostic Adaptation for Conditional Language Generation
Python
79
star
21

genbmm

CUDA kernels for generalized matrix-multiplication in PyTorch
Jupyter Notebook
79
star
22

DeepLatentNLP

61
star
23

nmt-android

Neural Machine Translation on Android
Lua
59
star
24

BSO

Lua
54
star
25

hmm-lm

Python
42
star
26

seq2seq-talk

TeX
39
star
27

Talk-Latent

TeX
31
star
28

regulatory-prediction

Code and Data to accompany "Dilated Convolutions for Modeling Long-Distance Genomic Dependencies", presented at the ICML 2017 Workshop on Computational Biology
Python
28
star
29

harvardnlp.github.io

JavaScript
26
star
30

strux

Python
18
star
31

lie-access-memory

Lua
17
star
32

annotated-attention

Jupyter Notebook
15
star
33

DataModules

A state-less module system for torch-like languages
Python
8
star
34

rush-nlp

JavaScript
8
star
35

seq2seq-attn-web

CSS
8
star
36

tutorial-deep-latent

TeX
7
star
37

MemN2N

Torch implementation of End-to-End Memory Networks (https://arxiv.org/abs/1503.08895)
Lua
6
star
38

image-extraction

Extract images from PDFs
Jupyter Notebook
4
star
39

paper-explorer

JavaScript
3
star
40

readcomp

Entity Tracking Improves Cloze-style Reading Comprehension
Python
3
star
41

banded

Sparse banded diagonal matrices for pytorch
Cuda
2
star
42

torax

Python
2
star
43

cs6741

HTML
2
star
44

simple-recs

Python
1
star
45

poser

Python
1
star
46

iclr

1
star
47

cs6741-materials

1
star