• Stars
    star
    160
  • Rank 234,703 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ACL'2020: Biomedical Entity Representations with Synonym Marginalization

BioSyn GitHub

Biomedical Entity Representations with Synonym Marginalization

BioSyn Overview

We present BioSyn for learning biomedical entity representations. You can train BioSyn with the two main components described in our paper: 1) synonym marginalization and 2) iterative candidate retrieval. Once you train BioSyn, you can easily normalize any biomedical mentions or represent them into entity embeddings.

Updates

  • [Mar 17, 2022] Checkpoints of BioSyn for normalizing gene type are released. The BC2GN data used for the gene type has been pre-processed by Tutubalina et al., 2020.
  • [Oct 25, 2021] Trained models are uploaded in Huggingface Hub(Please check out here). Other than BioBERT, we also train our model using another pre-trained model SapBERT, and obtain better performance than as described in our paper.

Requirements

$ conda create -n BioSyn python=3.7
$ conda activate BioSyn
$ conda install numpy tqdm scikit-learn
$ conda install pytorch=1.8.0 cudatoolkit=10.2 -c pytorch
$ pip install transformers==4.11.3

Note that Pytorch has to be installed depending on the version of CUDA.

Datasets

Datasets consist of queries (train, dev, test, and traindev), and dictionaries (train_dictionary, dev_dictionary, and test_dictionary). Note that the only difference between the dictionaries is that test_dictionary includes train and dev mentions, and dev_dictionary includes train mentions to increase the coverage. The queries are pre-processed with lowercasing, removing punctuations, resolving composite mentions and resolving abbreviation (Ab3P). The dictionaries are pre-processed with lowercasing, removing punctuations (If you need the pre-processing codes, please let us know by openning an issue).

Note that we use development (dev) set to search the hyperparameters, and train on traindev (train+dev) set to report the final performance.

TAC2017ADR dataset cannot be shared because of the license issue. Please visit the website or see here for pre-processing scripts.

Train

The following example fine-tunes our model on NCBI-Disease dataset (train+dev) with BioBERTv1.1.

MODEL_NAME_OR_PATH=dmis-lab/biobert-base-cased-v1.1
OUTPUT_DIR=./tmp/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease

CUDA_VISIBLE_DEVICES=1 python train.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --train_dictionary_path ${DATA_DIR}/train_dictionary.txt \
    --train_dir ${DATA_DIR}/processed_traindev \
    --output_dir ${OUTPUT_DIR} \
    --use_cuda \
    --topk 20 \
    --epoch 10 \
    --train_batch_size 16\
    --initial_sparse_weight 0\
    --learning_rate 1e-5 \
    --max_length 25 \
    --dense_ratio 0.5

Note that you can train the model on processed_train and evaluate it on processed_dev when you want to search for the hyperparameters. (the argument --save_checkpoint_all can be helpful. )

Evaluation

The following example evaluates our trained model with NCBI-Disease dataset (test).

MODEL_NAME_OR_PATH=./tmp/biosyn-biobert-ncbi-disease
OUTPUT_DIR=./tmp/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease

python eval.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --dictionary_path ${DATA_DIR}/test_dictionary.txt \
    --data_dir ${DATA_DIR}/processed_test \
    --output_dir ${OUTPUT_DIR} \
    --use_cuda \
    --topk 20 \
    --max_length 25 \
    --save_predictions \
    --score_mode hybrid

Result

The predictions are saved in predictions_eval.json with mentions, candidates and accuracies (the argument --save_predictions has to be on). Following is an example.

{
  "queries": [
    {
      "mentions": [
        {
          "mention": "ataxia telangiectasia",
          "golden_cui": "D001260",
          "candidates": [
            {
              "name": "ataxia telangiectasia",
              "cui": "D001260|208900",
              "label": 1
            },
            {
              "name": "ataxia telangiectasia syndrome",
              "cui": "D001260|208900",
              "label": 1
            },
            {
              "name": "ataxia telangiectasia variant",
              "cui": "C566865",
              "label": 0
            },
            {
              "name": "syndrome ataxia telangiectasia",
              "cui": "D001260|208900",
              "label": 1
            },
            {
              "name": "telangiectasia",
              "cui": "D013684",
              "label": 0
            }]
        }]
    },
    ...
    ],
    "acc1": 0.9114583333333334,
    "acc5": 0.9385416666666667
}

Inference

We provide a simple script that can normalize a biomedical mention or represent the mention into an embedding vector with BioSyn.

Trained models

NCBI-Disease

Model Acc@1/Acc@5
biosyn-biobert-ncbi-disease 91.1/93.9
biosyn-sapbert-ncbi-disease 92.4/95.8

BC5CDR-Disease

Model Acc@1/Acc@5
biosyn-biobert-bc5cdr-disease 93.2/96.0
biosyn-sapbert-bc5cdr-disease 93.5/96.4

BC5CDR-Chemical

Model Acc@1/Acc@5
biosyn-biobert-bc5cdr-chemical 96.6/97.2
biosyn-sapbert-bc5cdr-chemical 96.6/98.3

BC2GN-Gene

Model Acc@1/Acc@5
biosyn-biobert-bc2gn 90.6/95.6
biosyn-sapbert-bc2gn 91.3/96.3

Predictions (Top 5)

The example below gives the top 5 predictions for a mention ataxia telangiectasia. Note that the initial run will take some time to embed the whole dictionary. You can download the dictionary file here.

MODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease

python inference.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --dictionary_path ${DATA_DIR}/test_dictionary.txt \
    --use_cuda \
    --mention "ataxia telangiectasia" \
    --show_predictions

Result

{
  "mention": "ataxia telangiectasia", 
  "predictions": [
    {"name": "ataxia telangiectasia", "id": "D001260|208900"},
    {"name": "ataxia telangiectasia syndrome", "id": "D001260|208900"}, 
    {"name": "telangiectasia", "id": "D013684"}, 
    {"name": "telangiectasias", "id": "D013684"}, 
    {"name": "ataxia telangiectasia variant", "id": "C566865"}
  ]
}

Embeddings

The example below gives an embedding of a mention ataxia telangiectasia.

MODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease
DATA_DIR=./datasets/ncbi-disease

python inference.py \
    --model_name_or_path ${MODEL_NAME_OR_PATH} \
    --use_cuda \
    --mention "ataxia telangiectasia" \
    --show_embeddings

Result

{
  "mention": "ataxia telangiectasia", 
  "mention_sparse_embeds": array([0.05979538, 0., ..., 0., 0.], dtype=float32),
  "mention_dense_embeds": array([-7.14258850e-02, ..., -4.03847933e-01,],dtype=float32)
}

Demo

How to run web demo

Web demo is implemented on Tornado framework. If a dictionary is not yet cached, it will take about couple of minutes to create dictionary cache.

MODEL_NAME_OR_PATH=dmis-lab/biosyn-biobert-ncbi-disease

python demo.py \
  --model_name_or_path ${MODEL_NAME_OR_PATH} \
  --use_cuda \
  --dictionary_path ./datasets/ncbi-disease/test_dictionary.txt

Citations

@inproceedings{sung2020biomedical,
    title={Biomedical Entity Representations with Synonym Marginalization},
    author={Sung, Mujeen and Jeon, Hwisang and Lee, Jinhyuk and Kang, Jaewoo},
    booktitle={ACL},
    year={2020},
}

More Repositories

1

biobert

Bioinformatics'2020: BioBERT: a pre-trained biomedical language representation model for biomedical text mining
Python
1,929
star
2

biobert-pytorch

PyTorch Implementation of BioBERT
Java
300
star
3

bern

A neural named entity recognition and multi-type normalization tool for biomedical text mining
Python
173
star
4

BERN2

BERN2: an advanced neural biomedical namedentity recognition and normalization tool
Python
170
star
5

hats

HATS: A Hierarchical Graph Attention Network for Stock Movement Prediction
Python
147
star
6

bioasq-biobert

Pre-trained Language Model for Biomedical Question Answering
Python
122
star
7

GeNER

Simple Questions Generate Named Entity Recognition Datasets (EMNLP 2022)
Python
74
star
8

KitcheNette

KitcheNette: Predicting and Recommending Food Ingredient Pairings using Siamese Neural Networks
Python
69
star
9

covidAsk

covidAsk: Answering Questions on COVID-19 in Real-Time
Python
64
star
10

BioLAMA

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?
Python
54
star
11

ReSimNet

Implementation of ReSimNet for drug response similarity prediction
Jupyter Notebook
36
star
12

OLAPH

OLAPH: Improving Factuality in Biomedical Long-form Question Answering
Python
36
star
13

PerceiverCPI

Bioinformatics'2022 PerceiverCPI: A nested cross-attention network for compound-protein interaction prediction
Python
34
star
14

self-biorag

ISMB'24 "Self-BioRAG: Improving Medical Reasoning through Retrieval and Self-Reflection with Retrieval-Augmented Large Language Models"
Python
33
star
15

excord

Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering (Kim et al., ACL 2021)
Python
31
star
16

position-bias

EMNLP'2020: Look at the First Sentence: Position Bias in Question Answering
Python
29
star
17

TouR

Findings of ACL'2023: Optimizing Test-Time Query Representations for Dense Retrieval
Python
29
star
18

nesa

NESA: Neural Event Scheduling Assistant
Python
27
star
19

LIQUID

LIQUID: A Framework for List Question Anwering Dataset Generation (AAAI 2023)
Python
22
star
20

tbinet

TBiNet: A deep neural network for predicting transcription factor binding sites using attention mechanism
Jupyter Notebook
22
star
21

demographic-prediction

Predicting Multiple Demographic Attributes with Task Specific Embedding Transformation and Attention Network
Python
19
star
22

CompAct

[EMNLP 2024] CompAct: Compressing Retrieved Documents Actively for Question Answering
Python
16
star
23

VAECox

ISMB 2020: Improved survival analysis by learning shared genomic information from pan-cancer data
Python
16
star
24

ANGEL

Learning from Negative samples for Biomedical Generative Entity Linking
Python
15
star
25

moable

Predicting mechanism of action of novelcompounds using compound structure andtranscriptomic signature co-embedding
Python
13
star
26

cookingsense

CookingSense: A Culinary Knowledgebase with Multidisciplinary Assertions (LREC-COLING 2024)
Python
12
star
27

ConNER

Bioinformatics'2023: Consistency Enhancement of Model Prediction on Document-level Named Entity Recognition
Python
11
star
28

SeqTagQA

Sequence Tagging for Biomedical Extractive Question Answering (Bioinformatics'2020)
Python
11
star
29

ArkDTA

Python
11
star
30

bioasq8b

Transferability of Natural Language Inference to Biomedical Question Answering
Python
11
star
31

AdvSR

Adversarial Subword Regularization forRobust Neural Machine Translation
Python
10
star
32

MulinforCPI

MulinforCPI: enhancing precision of compound-protein interaction prediction through novel perspectives on multi-level information integration
Python
9
star
33

RecipeMind

RecipeMind: Guiding Ingredient Choices from Food Pairing to Recipe Completition using Cascaded Set Transformer (Mogan Gim et al., 2022)
Jupyter Notebook
8
star
34

KAZU-NER-module

EMNLP 2022: Biomedical NER for the Enterprise with Distillated BERN2 and the Kazu Framework
Python
8
star
35

CRADLE-VAE

Python
7
star
36

trnet

TRNet: A neural network model for predicting drug induced gene expression profiles
Python
6
star
37

KitchenScale

KitchenScale: Learning Food Numeracy from Recipes through Context-Aware Ingredient Quantity Prediction
Python
6
star
38

bioner-generalization

How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities?
Python
6
star
39

bc7-chem-id

DMIS at BioCreative VII NLMChem Track
Python
5
star
40

MolPLA

Python
5
star
41

bioasq9b-dmis

KU-DMIS at BioASQ 9
Jupyter Notebook
4
star
42

GLIT

GLIT: A Graph Neural Network for Drug-inducedLiver Injury Prediction using Transcriptome Data
Python
3
star
43

ParaCLIP

Fine-tuning CLIP Text Encoders with Two-step Paraphrasing (EACL 2024, Findings)
Python
3
star
44

arpnet

ARPNet: Antidepressant Response Prediction Network for Major Depressive Disorder
Python
2
star
45

bio-entity-extractor

Java
2
star
46

SMURF

SMURF: Machine learning pipeline for discovering cancer type specific driver mutations and diagnostic markers
Jupyter Notebook
2
star
47

LAPIS

Python
1
star
48

RAG2

1
star