• Stars
    star
    283
  • Rank 146,018 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created over 5 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora.

BLUE, the Biomedical Language Understanding Evaluation benchmark

***** New Aug 13th, 2019: Change DDI metric from micro-F1 to macro-F1 *****

***** New July 11th, 2019: preprocessed PubMed texts *****

We uploaded the preprocessed PubMed texts that were used to pre-train the NCBI_BERT models.

***** New June 17th, 2019: data in BERT format *****

We uploaded some datasets that are ready to be used with the NCBI BlueBERT codes.

Introduction

BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. Here, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.

Tasks

Corpus Train Dev Test Task Metrics Domain
MedSTS 675 75 318 Sentence similarity Pearson Clinical
BIOSSES 64 16 20 Sentence similarity Pearson Biomedical
BC5CDR-disease 4182 4244 4424 NER F1 Biomedical
BC5CDR-chemical 5203 5347 5385 NER F1 Biomedical
ShARe/CLEFE 4628 1075 5195 NER F1 Clinical
DDI 2937 1004 979 Relation extraction macro F1 Biomedical
ChemProt 4154 2416 3458 Relation extraction micro F1 Biomedical
i2b2-2010 3110 11 6293 Relation extraction F1 Clinical
HoC 1108 157 315 Document classification F1 Biomedical
MedNLI 11232 1395 1422 Inference accuracy Clinical

Sentence similarity

BIOSSES is a corpus of sentence pairs selected from the Biomedical Summarization Track Training Dataset in the biomedical domain. Here, we randomly select 80% for training and 20% for testing because there is no standard splits in the released data.

MedSTS is a corpus of sentence pairs selected from Mayo Clinics clinical data warehouse. Please visit the website to obtain a copy of the dataset. We use the standard training and testing sets in the shared task.

Named entity recognition

BC5CDR is a collection of 1,500 PubMed titles and abstracts selected from the CTD-Pfizer corpus and was used in the BioCreative V chemical-disease relation task We use the standard training and test set in the BC5CDR shared task

ShARe/CLEF eHealth Task 1 Corpus is a collection of 299 deidentified clinical free-text notes from the MIMIC II database Please visit the website to obtain a copy of the dataset. We use the standard training and test set in the ShARe/CLEF eHealth Tasks 1.

Relation extraction

DDI extraction 2013 corpus is a collection of 792 texts selected from the DrugBank database and other 233 Medline abstracts In our benchmark, we use 624 train files and 191 test files to evaluate the performance and report the macro-average F1-score of the four DDI types.

ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions and was used in the BioCreative VI text mining chemical-protein interactions shared task We use the standard training and test sets in the ChemProt shared task and evaluate the same five classes: CPR:3, CPR:4, CPR:5, CPR:6, and CPR:9.

i2b2 2010 shared task collection consists of 170 documents for training and 256 documents for testing, which is the subset of the original dataset. The dataset was collected from three different hospitals and was annotated by medical practitioners for eight types of relations between problems and treatments.

Document multilabel classification

HoC (the Hallmarks of Cancers corpus) consists of 1,580 PubMed abstracts annotated with ten currently known hallmarks of cancer We use 315 (~20%) abstracts for testing and the remaining abstracts for training. For the HoC task, we followed the common practice and reported the example-based F1-score on the abstract level

Inference task

MedNLI is a collection of sentence pairs selected from MIMIC-III. We use the same training, development, and test sets in Romanov and Shivade

Datasets

Some datasets can be downloaded at https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1

Baselines

Corpus Metrics SOTA* ELMo BioBERT NCBI_BERT(base) (P) NCBI_BERT(base) (P+M) NCBI_BERT(large) (P) NCBI_BERT(large) (P+M)
MedSTS Pearson 83.6 68.6 84.5 84.5 84.8 84.6 83.2
BIOSSES Pearson 84.8 60.2 82.7 89.3 91.6 86.3 75.1
BC5CDR-disease F 84.1 83.9 85.9 86.6 85.4 82.9 83.8
BC5CDR-chemical F 93.3 91.5 93.0 93.5 92.4 91.7 91.1
ShARe/CLEFE F 70.0 75.6 72.8 75.4 77.1 72.7 74.4
DDI F 72.9 62.0 78.8 78.1 79.4 79.9 76.3
ChemProt F 64.1 66.6 71.3 72.5 69.2 74.4 65.1
i2b2 2010 F 73.7 71.2 72.2 74.4 76.4 73.3 73.9
HoC F 81.5 80.0 82.9 85.3 83.1 87.3 85.3
MedNLI acc 73.5 71.4 80.5 82.2 84.0 81.5 83.8

P: PubMed, P+M: PubMed + MIMIC-III

SOTA, state-of-the-art as of April 2019, to the best of our knowledge

Fine-tuning with ELMo

We adopted the ELMo model pre-trained on PubMed abstracts to accomplish the BLUE tasks. The output of ELMo embeddings of each token is used as input for the fine-tuning model. We retrieved the output states of both layers in ELMo and concatenated them into one vector for each word. We used the maximum sequence length 128 for padding. The learning rate was set to 0.001 with an Adam optimizer. We iterated the training process for 20 epochs with batch size 64 and early stopped if the training loss did not decrease.

Fine-tuning with BERT

Please see https://github.com/ncbi-nlp/ncbi_bluebert.

Citing BLUE

@InProceedings{peng2019transfer,
  author    = {Yifan Peng and Shankai Yan and Zhiyong Lu},
  title     = {Transfer Learning in Biomedical Natural Language Processing: 
               An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
  booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
  year      = {2019},
}

Acknowledgments

This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number K99LM013001-01.

We are also grateful to the authors of BERT and ELMo to make the data and codes publicly available. We would like to thank Geeticka Chauhan for providing thoughtful comments.

Disclaimer

This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.

More Repositories

1

BioSentVec

BioWordVec & BioSentVec: pre-trained embeddings for biomedical words and sentences
Jupyter Notebook
567
star
2

bluebert

BlueBERT, pre-trained on PubMed abstracts and clinical notes (MIMIC-III).
Python
545
star
3

NegBio

๐Ÿ“ฐ High-performance tool for negation and uncertainty detection in radiology reports
Python
155
star
4

BioWordVec

Python
142
star
5

DeepSeeNet

๐Ÿ‘€ DeepSeeNet is a deep learning framework for classifying patient-based age-related macular degeneration severity in retinal color fundus photographs.
Python
57
star
6

PhenoTagger

PhenoTagger
GAP
50
star
7

Ab3P

Abbreviation definition dection library trained on PubMed abstracts
C
45
star
8

PubMed-Best-Match

Machine-learning based pipeline relying on LambdaMART currently used in PubMed for relevance (Best Match) searches
Python
39
star
9

TeamTat

Text annotation tool for team collaboration
JavaScript
35
star
10

ML_Net

ML-Net is a novel end-to-end deep learning framework for multi-label classification of biomedical tasks. ML-Net combines the label prediction network with a label count prediction network, which can determine the output labels based on both label confidence scores and document context in an end-to-end manner.
Python
33
star
11

TrialGPT

Code and data for TrialGPT.
Python
23
star
12

COVID-19-CT-CXR

COVID-19-CT-CXR, a public database of COVID-19 CXR and CT images
Python
20
star
13

MedCalc-Bench

Benchmarking the medical calculation capabilities of large language models.
Python
17
star
14

ezTag

Web interface that allows users to perform computer-assisted text annotation
JavaScript
15
star
15

BioC-JSON

Tool that converts between BioC XML files and BioC JSON files
Python
15
star
16

NQAC

Python
15
star
17

pubtator-gpt

13
star
18

DeepRel

A convolutional neural network model for relation extraction.
Python
12
star
19

NCBITextLib

Software library for building a large-scale data infrastructure for text mining
C++
7
star
20

BC6PM

Evaluation scripts for BioCreative VI Precision Medicine Track
Python
6
star
21

VarTriage

Python
5
star
22

PubTator-Covid19

4
star
23

CovidTermVar

Python
2
star
24

TaggerOne

General-purpose tagger for joint named entity recognition and normalization. Includes models for both diseases and chemicals (drugs) in biomedical publications.
2
star
25

Medical-Imaging-Causal-Fairness

Python
2
star
26

Unmasking-GPT-Bias

Jupyter Notebook
1
star
27

PhenoRerank

Python
1
star