• Stars
    star
    1,929
  • Rank 24,029 (Top 0.5 %)
  • Language
    Python
  • License
    Other
  • Created almost 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining

BioBERT

This repository provides the code for fine-tuning BioBERT, a biomedical language representation model designed for biomedical text mining tasks such as biomedical named entity recognition, relation extraction, question answering, etc. Please refer to our paper BioBERT: a pre-trained biomedical language representation model for biomedical text mining for more details. This project is done by DMIS-Lab.

Download

We provide five versions of pre-trained weights. Pre-training was based on the original BERT code provided by Google, and training details are described in our paper. Currently available versions of pre-trained weights are as follows (SHA1SUM):

Note that the performances of v1.0 and v1.1 base models (BioBERT-Base v1.0, BioBERT-Base v1.1) are reported in the paper. Alternately, you can download pre-trained weights from here

Installation

Sections below describe the installation and the fine-tuning process of BioBERT based on Tensorflow 1 (python version <= 3.7). For PyTorch version of BioBERT, you can check out this repository. If you are not familiar with coding and just want to recognize biomedical entities in your text using BioBERT, please use this tool which uses BioBERT for multi-type NER and normalization.

To fine-tune BioBERT, you need to download the pre-trained weights of BioBERT. After downloading the pre-trained weights, use requirements.txt to install BioBERT as follows:

$ git clone https://github.com/dmis-lab/biobert.git
$ cd biobert; pip install -r requirements.txt

Note that this repository is based on the BERT repository by Google. All the fine-tuning experiments were conducted on a single TITAN Xp GPU machine which has 12GB of RAM. You might want to install java to use the official evaluation script of BioASQ. See requirements.txt for other details.

Quick Links

Link Detail
BioBERT-PyTorch PyTorch-based BioBERT implementation
BERN Web-based biomedical NER + normalization using BioBERT
BERN2 Advanced version of BERN (web-based biomedical NER) w/ NER from BioLM + NEN from PubMedBERT
covidAsk BioBERT based real-time question answering model for COVID-19
7th BioASQ Code for the seventh BioASQ challenge winning model (factoid/yesno/list)
Paper Paper link with BibTeX (Bioinformatics)

FAQs

Datasets

We provide a pre-processed version of benchmark datasets for each task as follows:

You can simply run download.sh to download all the datasets at once.

$ ./download.sh

This will download the datasets under the folder datasets. Due to the copyright issue of other datasets, we provide links of those datasets instead: 2010 i2b2/VA, ChemProt.

Fine-tuning BioBERT

After downloading one of the pre-trained weights, unpack it to any directory you want, and we will denote this as $BIOBERT_DIR. For instance, when using BioBERT-Base v1.1 (+ PubMed 1M), set BIOBERT_DIR environment variable as:

$ export BIOBERT_DIR=./biobert_v1.1_pubmed
$ echo $BIOBERT_DIR
>>> ./biobert_v1.1_pubmed

Named Entity Recognition (NER)

Let $NER_DIR indicate a folder for a single NER dataset which contains train_dev.tsv, train.tsv, devel.tsv and test.tsv. Also, set $OUTPUT_DIR as a directory for NER outputs (trained models, test predictions, etc). For example, when fine-tuning on the NCBI disease corpus,

$ export NER_DIR=./datasets/NER/NCBI-disease
$ export OUTPUT_DIR=./ner_outputs

Following command runs fine-tuning code on NER with default arguments.

$ mkdir -p $OUTPUT_DIR
$ python run_ner.py --do_train=true --do_eval=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --num_train_epochs=10.0 --data_dir=$NER_DIR --output_dir=$OUTPUT_DIR

You can change the arguments as you want. Once you have trained your model, you can use it in inference mode by using --do_train=false --do_predict=true for evaluating test.tsv. The token-level evaluation result will be printed as stdout format. For example, the result for NCBI-disease dataset will be like this:

INFO:tensorflow:***** token-level evaluation results *****
INFO:tensorflow:  eval_f = 0.8972311
INFO:tensorflow:  eval_precision = 0.88150835
INFO:tensorflow:  eval_recall = 0.9136615
INFO:tensorflow:  global_step = 2571
INFO:tensorflow:  loss = 28.247158

(tips : You should go up a few lines to find the result. It comes before INFO:tensorflow:**** Trainable Variables **** )

Note that this result is the token-level evaluation measure while the official evaluation should use the entity-level evaluation measure. The results of python run_ner.py will be recorded as two files: token_test.txt and label_test.txt in $OUTPUT_DIR. Use ./biocodes/ner_detokenize.py to obtain word level prediction file.

$ python biocodes/ner_detokenize.py --token_test_path=$OUTPUT_DIR/token_test.txt --label_test_path=$OUTPUT_DIR/label_test.txt --answer_path=$NER_DIR/test.tsv --output_dir=$OUTPUT_DIR

This will generate NER_result_conll.txt in $OUTPUT_DIR. Use ./biocodes/conlleval.pl for entity-level exact match evaluation results.

$ perl biocodes/conlleval.pl < $OUTPUT_DIR/NER_result_conll.txt

The entity-level results for the NCBI disease corpus will be like:

processed 24497 tokens with 960 phrases; found: 983 phrases; correct: 852.
accuracy:  98.49%; precision:  86.67%; recall:  88.75%; FB1:  87.70
             MISC: precision:  86.67%; recall:  88.75%; FB1:  87.70  983

Note that this is a sample run of an NER model. The performance of NER models usually converges at more than 50 epochs (learning rate = 1e-5 is recommended).

Relation Extraction (RE)

Let $RE_DIR indicate a folder for a single RE dataset, $TASK_NAME denote the name of task (two possible options: {gad, euadr}), and $OUTPUT_DIR denote a directory for RE outputs:

$ export RE_DIR=./datasets/RE/GAD/1
$ export TASK_NAME=gad
$ export OUTPUT_DIR=./re_outputs_1

Following command runs fine-tuning code on RE with default arguments.

$ python run_re.py --task_name=$TASK_NAME --do_train=true --do_eval=true --do_predict=true --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=128 --train_batch_size=32 --learning_rate=2e-5 --num_train_epochs=3.0 --do_lower_case=false --data_dir=$RE_DIR --output_dir=$OUTPUT_DIR

The predictions will be saved into a file called test_results.tsv in the $OUTPUT_DIR. Use ./biocodes/re_eval.py for the evaluation. Note that the CHEMPROT dataset is a multi-class classification dataset and to evaluate the CHEMPROT result, you should run re_eval.py with additional --task=chemprot flag.

$ python ./biocodes/re_eval.py --output_path=$OUTPUT_DIR/test_results.tsv --answer_path=$RE_DIR/test.tsv

The result for GAD dataset will be like this:

f1 score    : 83.74%
recall      : 90.75%
precision   : 77.74%
specificity : 71.15%

Please be aware that you have to change $OUTPUT_DIR to train/test a new model. For instance, as most RE datasets are in 10-fold, you have to make a different output directory to train/test a model for a different fold (e.g., $ export OUTPUT_DIR=./re_outputs_2).

Question Answering (QA)

To use the BioASQ dataset, you need to register in the BioASQ website which authorizes the use of the dataset. Please unpack the pre-processed BioASQ dataset provided above to a directory $QA_DIR. For example, with $OUTPUT_DIR for QA outputs, set as:

$ export QA_DIR=./datasets/QA/BioASQ
$ export OUTPUT_DIR=./qa_outputs

Files named as BioASQ-*.json are used for training and testing the model which are the pre-processed format for BioBERT. Note that we pre-trained our model on SQuAD dataset to get state-of-the-art performance (see here to get BioBERT pre-trained on SQuAD), and you might have to change $BIOBERT_DIR accordingly. Following command runs fine-tuning code on QA with default arguments.

$ python run_qa.py --do_train=True --do_predict=True --vocab_file=$BIOBERT_DIR/vocab.txt --bert_config_file=$BIOBERT_DIR/bert_config.json --init_checkpoint=$BIOBERT_DIR/model.ckpt-1000000 --max_seq_length=384 --train_batch_size=12 --learning_rate=5e-6 --doc_stride=128 --num_train_epochs=5.0 --do_lower_case=False --train_file=$QA_DIR/BioASQ-train-factoid-4b.json --predict_file=$QA_DIR/BioASQ-test-factoid-4b-1.json --output_dir=$OUTPUT_DIR

The predictions will be saved into a file called predictions.json and nbest_predictions.json in $OUTPUT_DIR. Run ./biocodes/transform_nbset2bioasqform.py to convert nbest_predictions.json to the BioASQ JSON format, which will be used for the official evaluation.

$ python ./biocodes/transform_nbset2bioasqform.py --nbest_path=$OUTPUT_DIR/nbest_predictions.json --output_path=$OUTPUT_DIR

This will generate BioASQform_BioASQ-answer.json in $OUTPUT_DIR. Clone evaluation code from BioASQ github and run evaluation code on Evaluation-Measures directory. Please note that you should always put 5 as parameter for -e.

$ git clone https://github.com/BioASQ/Evaluation-Measures.git
$ cd Evaluation-Measures
$ java -Xmx10G -cp $CLASSPATH:./flat/BioASQEvaluation/dist/BioASQEvaluation.jar evaluation.EvaluatorTask1b -phaseB -e 5 ../$QA_DIR/4B1_golden.json ../$OUTPUT_DIR/BioASQform_BioASQ-answer.json

As our model is only on factoid questions, the result will be like,

0.0 0.3076923076923077 0.5384615384615384 0.394017094017094 0.0 0.0 0.0 0.0 0.0 0.0

where the second, third and fourth numbers will be SAcc, LAcc, and MRR of factoid questions respectively. For list and yes/no type questions, please refer to our repository for BioBERT at the 7th BioASQ Challenge.

License and Disclaimer

Please see the LICENSE file for details. Downloading data indicates your acceptance of our disclaimer.

Citation

@article{lee2020biobert,
  title={BioBERT: a pre-trained biomedical language representation model for biomedical text mining},
  author={Lee, Jinhyuk and Yoon, Wonjin and Kim, Sungdong and Kim, Donghyeon and Kim, Sunkyu and So, Chan Ho and Kang, Jaewoo},
  journal={Bioinformatics},
  volume={36},
  number={4},
  pages={1234--1240},
  year={2020},
  publisher={Oxford University Press}
}

Contact Information

For help or issues using BioBERT, please submit a GitHub issue. Please contact Jinhyuk Lee (lee.jnhk (at) gmail.com), or Wonjin Yoon (wonjin.info (at) gmail.com) for communication related to BioBERT.

More Repositories

1

biobert-pytorch

PyTorch Implementation of BioBERT
Java
300
star
2

bern

A neural named entity recognition and multi-type normalization tool for biomedical text mining
Python
173
star
3

BERN2

BERN2: an advanced neural biomedical namedentity recognition and normalization tool
Python
170
star
4

BioSyn

ACL'2020: Biomedical Entity Representations with Synonym Marginalization
Python
160
star
5

hats

HATS: A Hierarchical Graph Attention Network for Stock Movement Prediction
Python
147
star
6

bioasq-biobert

Pre-trained Language Model for Biomedical Question Answering
Python
122
star
7

GeNER

Simple Questions Generate Named Entity Recognition Datasets (EMNLP 2022)
Python
74
star
8

KitcheNette

KitcheNette: Predicting and Recommending Food Ingredient Pairings using Siamese Neural Networks
Python
69
star
9

covidAsk

covidAsk: Answering Questions on COVID-19 in Real-Time
Python
64
star
10

BioLAMA

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?
Python
54
star
11

ReSimNet

Implementation of ReSimNet for drug response similarity prediction
Jupyter Notebook
36
star
12

OLAPH

OLAPH: Improving Factuality in Biomedical Long-form Question Answering
Python
36
star
13

PerceiverCPI

Bioinformatics'2022 PerceiverCPI: A nested cross-attention network for compound-protein interaction prediction
Python
34
star
14

self-biorag

ISMB'24 "Self-BioRAG: Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models"
Python
33
star
15

excord

Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering (Kim et al., ACL 2021)
Python
31
star
16

position-bias

EMNLP'2020: Look at the First Sentence: Position Bias in Question Answering
Python
29
star
17

TouR

Findings of ACL'2023: Optimizing Test-Time Query Representations for Dense Retrieval
Python
29
star
18

nesa

NESA: Neural Event Scheduling Assistant
Python
27
star
19

LIQUID

LIQUID: A Framework for List Question Anwering Dataset Generation (AAAI 2023)
Python
22
star
20

tbinet

TBiNet: A deep neural network for predicting transcription factor binding sites using attention mechanism
Jupyter Notebook
22
star
21

demographic-prediction

Predicting Multiple Demographic Attributes with Task Specific Embedding Transformation and Attention Network
Python
19
star
22

CompAct

[EMNLP 2024] CompAct: Compressing Retrieved Documents Actively for Question Answering
Python
16
star
23

VAECox

ISMB 2020: Improved survival analysis by learning shared genomic information from pan-cancer data
Python
16
star
24

ANGEL

Learning from Negative samples for Biomedical Generative Entity Linking
Python
15
star
25

moable

Predicting mechanism of action of novelcompounds using compound structure andtranscriptomic signature co-embedding
Python
13
star
26

cookingsense

CookingSense: A Culinary Knowledgebase with Multidisciplinary Assertions (LREC-COLING 2024)
Python
12
star
27

ConNER

Bioinformatics'2023: Consistency Enhancement of Model Prediction on Document-level Named Entity Recognition
Python
11
star
28

SeqTagQA

Sequence Tagging for Biomedical Extractive Question Answering (Bioinformatics'2020)
Python
11
star
29

ArkDTA

Python
11
star
30

bioasq8b

Transferability of Natural Language Inference to Biomedical Question Answering
Python
11
star
31

AdvSR

Adversarial Subword Regularization forRobust Neural Machine Translation
Python
10
star
32

MulinforCPI

MulinforCPI: enhancing precision of compound-protein interaction prediction through novel perspectives on multi-level information integration
Python
9
star
33

RecipeMind

RecipeMind: Guiding Ingredient Choices from Food Pairing to Recipe Completition using Cascaded Set Transformer (Mogan Gim et al., 2022)
Jupyter Notebook
8
star
34

KAZU-NER-module

EMNLP 2022: Biomedical NER for the Enterprise with Distillated BERN2 and the Kazu Framework
Python
8
star
35

CRADLE-VAE

Python
7
star
36

trnet

TRNet: A neural network model for predicting drug induced gene expression profiles
Python
6
star
37

KitchenScale

KitchenScale: Learning Food Numeracy from Recipes through Context-Aware Ingredient Quantity Prediction
Python
6
star
38

bioner-generalization

How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities?
Python
6
star
39

bc7-chem-id

DMIS at BioCreative VII NLMChem Track
Python
5
star
40

MolPLA

Python
5
star
41

bioasq9b-dmis

KU-DMIS at BioASQ 9
Jupyter Notebook
4
star
42

GLIT

GLIT: A Graph Neural Network for Drug-inducedLiver Injury Prediction using Transcriptome Data
Python
3
star
43

ParaCLIP

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing (EACL 2024, Findings)
Python
3
star
44

arpnet

ARPNet: Antidepressant Response Prediction Network for Major Depressive Disorder
Python
2
star
45

bio-entity-extractor

Java
2
star
46

SMURF

SMURF: Machine learning pipeline for discovering cancer type specific driver mutations and diagnostic markers
Jupyter Notebook
2
star
47

LAPIS

Python
1
star
48

RAG2

1
star