• Stars
    star
    155
  • Rank 240,864 (Top 5 %)
  • Language
  • Created about 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PubMed 200k RCT dataset: a large dataset for sequential sentence classification.

PubMed 200k RCT dataset

The PubMed 200k RCT dataset is described in Franck Dernoncourt, Ji Young Lee. PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts. International Joint Conference on Natural Language Processing (IJCNLP). 2017.

Abstract:

PubMed 200k RCT is new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.

Some miscellaneous information:

  • PubMed 20k is a subset of PubMed 200k. I.e., any abstract present in PubMed 20k is also present in PubMed 200k.
  • PubMed_200k_RCT is the same as PubMed_200k_RCT_numbers_replaced_with_at_sign, except that in the latter all numbers had been replaced by @. (same for PubMed_20k_RCT vs. PubMed_20k_RCT_numbers_replaced_with_at_sign).
  • Since Github file size limit is 100 MiB, we had to compress PubMed_200k_RCT\train.7z and PubMed_200k_RCT_numbers_replaced_with_at_sign\train.zip. To uncompress train.7z, you may use 7-Zip on Windows, Keka on Mac OS X, or p7zip on Linux.

You are most welcome to share with us your analyses or work using this dataset!

Projects using the PubMed 200k RCT dataset

More Repositories

1

NeuroNER

Named-entity recognition using neural networks. Easy-to-use and state-of-the-art results.
Python
1,689
star
2

ASR_benchmark

Program to benchmark various speech recognition APIs
Python
78
star
3

caffe_demos

Python
50
star
4

Coursera-Machine-Learning-Fall2013-4thEd-AndrewNg

Coursera-Machine-Learning-Fall2013-4thEd-AndrewNg
MATLAB
45
star
5

naacl2016

Data splits for the NAACL 2016 paper
22
star
6

adobe-connect-video-downloader

Python
22
star
7

neuroclick

C
10
star
8

trackmania-np-complete

TrackMania is NP-complete
9
star
9

github-max-file-size

6
star
10

planet-wars

Google AI Contest - Autumn 2010
Python
5
star
11

Coursera-big-data-in-education-Fall2013-RyanBaker

Coursera-big-data-in-education-Fall2013-RyanBaker
MATLAB
3
star
12

NRC_Emotion_Lexicon

NRC Emotion Lexicon v0.92
2
star
13

summarization-corpora

Survey of corpora for summarization
2
star
14

ez-icu

6.831, Spring 2013: EZ-ICU's computer prototypes and interfaces
JavaScript
2
star
15

MIT-15.058

MIT 15.058 campus project
Python
1
star
16

AutoHotkeyScripts

Scripts for AutoHotkey
AutoHotkey
1
star
17

alfa_ecstar

alfa_ecstar
MATLAB
1
star
18

Coursera-General-Game-Playing-Fall2013-Genesereth

Coursera-General-Game-Playing-Fall2013-Genesereth
Shell
1
star
19

mRF2011

The medial Reticular Formation (mRF): a neural substrate for action selection? An evaluation via evolutionary computation. Master's Thesis (ENS/EHESS/Paris 5).
C++
1
star
20

edX-CS1156x-ML-Spring2013

CaltechX: CS1156x Learning From Data (introductory Machine Learning course)
Python
1
star
21

MIT-6831-UI-Spring2013

JavaScript
1
star
22

MIT-6830-databases

6.830/6.814: Database Systems
1
star
23

MIB

CogMaster - AE(a) Project - Motion-induced blindness
Python
1
star
24

mRF2011-withSferes

The medial Reticular Formation (mRF): a neural substrate for action selection? An evaluation via evolutionary computation. Master's Thesis (ENS/EHESS/Paris 5). Website: http://francky.me/publications.php#mRF2011
C++
1
star