BLUE, the Biomedical Language Understanding Evaluation benchmark
***** New Aug 13th, 2019: Change DDI metric from micro-F1 to macro-F1 *****
***** New July 11th, 2019: preprocessed PubMed texts *****
We uploaded the preprocessed PubMed texts that were used to pre-train the NCBI_BERT models.
***** New June 17th, 2019: data in BERT format *****
We uploaded some datasets that are ready to be used with the NCBI BlueBERT codes.
Introduction
BLUE benchmark consists of five different biomedicine text-mining tasks with ten corpora. Here, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks. These tasks cover a diverse range of text genres (biomedical literature and clinical notes), dataset sizes, and degrees of difficulty and, more importantly, highlight common biomedicine text-mining challenges.
Tasks
Corpus | Train | Dev | Test | Task | Metrics | Domain |
---|---|---|---|---|---|---|
MedSTS | 675 | 75 | 318 | Sentence similarity | Pearson | Clinical |
BIOSSES | 64 | 16 | 20 | Sentence similarity | Pearson | Biomedical |
BC5CDR-disease | 4182 | 4244 | 4424 | NER | F1 | Biomedical |
BC5CDR-chemical | 5203 | 5347 | 5385 | NER | F1 | Biomedical |
ShARe/CLEFE | 4628 | 1075 | 5195 | NER | F1 | Clinical |
DDI | 2937 | 1004 | 979 | Relation extraction | macro F1 | Biomedical |
ChemProt | 4154 | 2416 | 3458 | Relation extraction | micro F1 | Biomedical |
i2b2-2010 | 3110 | 11 | 6293 | Relation extraction | F1 | Clinical |
HoC | 1108 | 157 | 315 | Document classification | F1 | Biomedical |
MedNLI | 11232 | 1395 | 1422 | Inference | accuracy | Clinical |
Sentence similarity
BIOSSES is a corpus of sentence pairs selected from the Biomedical Summarization Track Training Dataset in the biomedical domain. Here, we randomly select 80% for training and 20% for testing because there is no standard splits in the released data.
MedSTS is a corpus of sentence pairs selected from Mayo Clinics clinical data warehouse. Please visit the website to obtain a copy of the dataset. We use the standard training and testing sets in the shared task.
Named entity recognition
BC5CDR is a collection of 1,500 PubMed titles and abstracts selected from the CTD-Pfizer corpus and was used in the BioCreative V chemical-disease relation task We use the standard training and test set in the BC5CDR shared task
ShARe/CLEF eHealth Task 1 Corpus is a collection of 299 deidentified clinical free-text notes from the MIMIC II database Please visit the website to obtain a copy of the dataset. We use the standard training and test set in the ShARe/CLEF eHealth Tasks 1.
Relation extraction
DDI extraction 2013 corpus is a collection of 792 texts selected from the DrugBank database and other 233 Medline abstracts In our benchmark, we use 624 train files and 191 test files to evaluate the performance and report the macro-average F1-score of the four DDI types.
ChemProt consists of 1,820 PubMed abstracts with chemical-protein interactions and was used in the BioCreative VI text mining chemical-protein interactions shared task We use the standard training and test sets in the ChemProt shared task and evaluate the same five classes: CPR:3, CPR:4, CPR:5, CPR:6, and CPR:9.
i2b2 2010 shared task collection consists of 170 documents for training and 256 documents for testing, which is the subset of the original dataset. The dataset was collected from three different hospitals and was annotated by medical practitioners for eight types of relations between problems and treatments.
Document multilabel classification
HoC (the Hallmarks of Cancers corpus) consists of 1,580 PubMed abstracts annotated with ten currently known hallmarks of cancer We use 315 (~20%) abstracts for testing and the remaining abstracts for training. For the HoC task, we followed the common practice and reported the example-based F1-score on the abstract level
Inference task
MedNLI is a collection of sentence pairs selected from MIMIC-III. We use the same training, development, and test sets in Romanov and Shivade
Datasets
Some datasets can be downloaded at https://github.com/ncbi-nlp/BLUE_Benchmark/releases/tag/0.1
Baselines
Corpus | Metrics | SOTA* | ELMo | BioBERT | NCBI_BERT(base) (P) | NCBI_BERT(base) (P+M) | NCBI_BERT(large) (P) | NCBI_BERT(large) (P+M) |
---|---|---|---|---|---|---|---|---|
MedSTS | Pearson | 83.6 | 68.6 | 84.5 | 84.5 | 84.8 | 84.6 | 83.2 |
BIOSSES | Pearson | 84.8 | 60.2 | 82.7 | 89.3 | 91.6 | 86.3 | 75.1 |
BC5CDR-disease | F | 84.1 | 83.9 | 85.9 | 86.6 | 85.4 | 82.9 | 83.8 |
BC5CDR-chemical | F | 93.3 | 91.5 | 93.0 | 93.5 | 92.4 | 91.7 | 91.1 |
ShARe/CLEFE | F | 70.0 | 75.6 | 72.8 | 75.4 | 77.1 | 72.7 | 74.4 |
DDI | F | 72.9 | 62.0 | 78.8 | 78.1 | 79.4 | 79.9 | 76.3 |
ChemProt | F | 64.1 | 66.6 | 71.3 | 72.5 | 69.2 | 74.4 | 65.1 |
i2b2 2010 | F | 73.7 | 71.2 | 72.2 | 74.4 | 76.4 | 73.3 | 73.9 |
HoC | F | 81.5 | 80.0 | 82.9 | 85.3 | 83.1 | 87.3 | 85.3 |
MedNLI | acc | 73.5 | 71.4 | 80.5 | 82.2 | 84.0 | 81.5 | 83.8 |
P: PubMed, P+M: PubMed + MIMIC-III
SOTA, state-of-the-art as of April 2019, to the best of our knowledge
- MedSTS, BIOSSES: Chen et al. 2019. BioSentVec: creating sentence embeddings for biomedical texts. In Proceedings of the 7th IEEE International Conference on Healthcare Informatics.
- BC5CDR-disease, BC5CDR-chem: Yoon et al. 2018. CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. arXiv preprint arXiv:1809.07950.
- ShARe/CLEFE: Leaman et al. 2015. Challenges in clinical natural language processing for automated disorder normalization. Journal of biomedical informatics, 57:28โ37.
- DDI: Zhang et al. 2018. Drug-drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths. Bioinformatics (Oxford, England), 34:828โ835.
- Chem-Prot: Peng et al. 2018. Extracting chemical-protein relations with ensembles of SVM and deep learning models. Database: the journal of biological databases and curation, 2018.
- i2b2 2010: Rink et al. 2011. Automatic extraction of relations between medical concepts in clinical texts. Journal of the American Medical Informatics Association, 18:594โ600.
- HoC: Du et al. 2019. ML-Net: multilabel classification of biomedical texts with deep neural networks. Journal of the American Medical Informatics Association (JAMIA).
- MedNLI: Romanov et al. 2018. Lessons from natural language inference in the clinical domain. In Proceedings of EMNLP, pages 1586โ1596.
Fine-tuning with ELMo
We adopted the ELMo model pre-trained on PubMed abstracts to accomplish the BLUE tasks. The output of ELMo embeddings of each token is used as input for the fine-tuning model. We retrieved the output states of both layers in ELMo and concatenated them into one vector for each word. We used the maximum sequence length 128 for padding. The learning rate was set to 0.001 with an Adam optimizer. We iterated the training process for 20 epochs with batch size 64 and early stopped if the training loss did not decrease.
Fine-tuning with BERT
Please see https://github.com/ncbi-nlp/ncbi_bluebert.
Citing BLUE
- Peng Y, Yan S, Lu Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. In Proceedings of the Workshop on Biomedical Natural Language Processing (BioNLP). 2019.
@InProceedings{peng2019transfer,
author = {Yifan Peng and Shankai Yan and Zhiyong Lu},
title = {Transfer Learning in Biomedical Natural Language Processing:
An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
year = {2019},
}
Acknowledgments
This work was supported by the Intramural Research Programs of the National Institutes of Health, National Library of Medicine and Clinical Center. This work was supported by the National Library of Medicine of the National Institutes of Health under award number K99LM013001-01.
We are also grateful to the authors of BERT and ELMo to make the data and codes publicly available. We would like to thank Geeticka Chauhan for providing thoughtful comments.
Disclaimer
This tool shows the results of research conducted in the Computational Biology Branch, NCBI. The information produced on this website is not intended for direct diagnostic use or medical decision-making without review and oversight by a clinical professional. Individuals should not change their health behavior solely on the basis of information produced on this website. NIH does not independently verify the validity or utility of the information produced by this tool. If you have questions about the information produced on this website, please see a health care professional. More information about NCBI's disclaimer policy is available.