• Stars
    star
    167
  • Rank 226,635 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[NAACL'21 & ACL'21] SapBERT: Self-alignment pretraining for BERT & XL-BEL: Cross-Lingual Biomedical Entity Linking.

SapBERT: Self-alignment pretraining for BERT

[news | 22 Aug 2021] SapBERT is integrated into NVIDIA's deep learning toolkit NeMo as its entity linking module (thank you NVIDIA!). You can play with it in this google colab.


This repo holds code, data, and pretrained weights for (1) the SapBERT model presented in our NAACL 2021 paper: Self-Alignment Pretraining for Biomedical Entity Representations; (2) the cross-lingual SapBERT and a cross-lingual biomedical entity linking benchmark (XL-BEL) proposed in our ACL 2021 paper: Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking.

front-page-graph

Huggingface Models

English Models: [SapBERT] and [SapBERT-mean-token]

Standard SapBERT as described in [Liu et al., NAACL 2021]. Trained with UMLS 2020AA (English only), using microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext as the base model. For [SapBERT], use [CLS] (before pooler) as the representation of the input; for [SapBERT-mean-token], use mean-pooling across all tokens.

Cross-Lingual Models: [SapBERT-XLMR] and [SapBERT-XLMR-large]

Cross-lingual SapBERT as described in [Liu et al., ACL 2021]. Trained with UMLS 2020AB (all languages), using xlm-roberta-base/xlm-roberta-large as the base model. Use [CLS] (before pooler) as the representation of the input.

Environment

The code is tested with python 3.8, torch 1.7.0 and huggingface transformers 4.4.2. Please view requirements.txt for more details.

Embedding Extraction with SapBERT

The following script converts a list of strings (entity names) into embeddings.

import numpy as np
import torch
from tqdm.auto import tqdm
from transformers import AutoTokenizer, AutoModel  

tokenizer = AutoTokenizer.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext")  
model = AutoModel.from_pretrained("cambridgeltl/SapBERT-from-PubMedBERT-fulltext").cuda()

# replace with your own list of entity names
all_names = ["covid-19", "Coronavirus infection", "high fever", "Tumor of posterior wall of oropharynx"] 

bs = 128 # batch size during inference
all_embs = []
for i in tqdm(np.arange(0, len(all_names), bs)):
    toks = tokenizer.batch_encode_plus(all_names[i:i+bs], 
                                       padding="max_length", 
                                       max_length=25, 
                                       truncation=True,
                                       return_tensors="pt")
    toks_cuda = {}
    for k,v in toks.items():
        toks_cuda[k] = v.cuda()
    cls_rep = model(**toks_cuda)[0][:,0,:] # use CLS representation as the embedding
    all_embs.append(cls_rep.cpu().detach().numpy())

all_embs = np.concatenate(all_embs, axis=0)

Please see inference/inference_on_snomed.ipynb for a more extensive inference example.

Train SapBERT

Extract training data from UMLS as insrtructed in training_data/generate_pretraining_data.ipynb (we cannot directly release the training file due to licensing issues).

Run:

>> cd train/
>> ./pretrain.sh 0,1 

where 0,1 specifies the GPU devices.

For finetuning on your customised dataset, generate data in the format of

concept_id || entity_name_1 || entity_name_2
...

where entity_name_1 and entity_name_2 are synonym pairs (belonging to the same concept concept_id) sampled from a given labelled dataset. If one concept is associated with multiple entity names in the dataset, you could traverse all the pairwise combinations.

For cross-lingual SAP-tuning with general domain parallel data (muse, wiki titles, or both), the data can be found in training_data/general_domain_parallel_data/. An example script: train/xling_train.sh.

Evaluate SapBERT

For evaluation (both monlingual and cross-lingual), please view evaluation/README.md for details. evaluation/xl_bel/ contains the XL-BEL benchmark proposed in [Liu et al., ACL 2021].

Citations

SapBERT:

@inproceedings{liu2021self,
	title={Self-Alignment Pretraining for Biomedical Entity Representations},
	author={Liu, Fangyu and Shareghi, Ehsan and Meng, Zaiqiao and Basaldella, Marco and Collier, Nigel},
	booktitle={Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies},
	pages={4228--4238},
	month = jun,
	year={2021}
}

Cross-lingual SapBERT and XL-BEL:

@inproceedings{liu2021learning,
	title={Learning Domain-Specialised Representations for Cross-Lingual Biomedical Entity Linking},
	author={Liu, Fangyu and Vuli{\'c}, Ivan and Korhonen, Anna and Collier, Nigel},
	booktitle={Proceedings of ACL-IJCNLP 2021},
	pages = {565--574},
	month = aug,
	year={2021}
}

Acknowledgement

Parts of the code are modified from BioSyn. We appreciate the authors for making BioSyn open-sourced.

License

SapBERT is MIT licensed. See the LICENSE file for details.

More Repositories

1

visual-med-alpaca

Visual Med-Alpaca is an open-source, multi-modal foundation model designed specifically for the biomedical domain, built on the LLaMa-7B.
Python
358
star
2

MTL-Bioinformatics-2016

Python
223
star
3

BioNLP-2016

Python
121
star
4

xcopa

XCOPA: A Multilingual Dataset for Causal Commonsense Reasoning
97
star
5

visual-spatial-reasoning

[TACL'23] VSR: A probing benchmark for spatial undersranding of vision-language models.
Python
92
star
6

mirror-bert

[EMNLP'21] Mirror-BERT: Converting Pretrained Language Models to universal text encoders without labels.
Python
75
star
7

composable-sft

A library for parameter-efficient and composable transfer learning for NLP with sparse fine-tunings.
Python
68
star
8

cometa

Corpus of Online Medical EnTities: the cometA corpus
Jupyter Notebook
46
star
9

autopeft

AutoPEFT: Automatic Configuration Search for Parameter-Efficient Fine-Tuning (Zhou et al.; TACL)
Python
42
star
10

parameter-factorization

Factorization of the neural parameter space for zero-shot multi-lingual and multi-task transfer
Python
39
star
11

ContrastiveBLI

Improving Word Translation via Two-Stage Contrastive Learning (ACL 2022). Keywords: Bilingual Lexicon Induction, Word Translation, Cross-Lingual Word Embeddings.
Python
32
star
12

PairS

Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators (Liu et al.; arXiv preprint arXiv:2403.16950)
Python
32
star
13

link-prediction_with_deep-learning

Python
28
star
14

eva

[AAAI'21] Code release for "Visual Pivoting for (Unsupervised) Entity Alignment".
Python
25
star
15

mop

Codes for paper: Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT
Python
24
star
16

python4cl

Introductory Python course for computational lingustics
Jupyter Notebook
23
star
17

adversarial-postspec

Auxiliary GAN for WE post-specialisation
Python
23
star
18

ClaPS

Survival of the Most Influential Prompts: Efficient Black-Box Prompt Search via Clustering and Pruning (Zhou et al.; EMNLP 2023 Findings)
Python
16
star
19

SIPHS

15
star
20

ACL2022_tutorial_multilingual_dialogue

Materials for "Natural Language Processing for Multilingual Task-Oriented Dialogue" Tutorial at ACL 2022
14
star
21

multi3woz

The official repository for Multi3WOZ: A Multilingual, Multi-Domain, Multi-Parallel Dataset for Training and Evaluating Culturally Adapted Task-Oriented Dialog Systems (TACL 2023)
Python
14
star
22

BLICEr

Improving Bilingual Lexicon Induction with Cross-Encoder Reranking (Findings of EMNLP 2022). Keywords: Bilingual Lexicon Induction, Word Translation, Cross-Lingual Word Embeddings.
Python
13
star
23

ECNMT

Emergent Communication Pretraining for Few-Shot Machine Translation
Python
13
star
24

multilabel-nn

Initializing neural networks for hierarchical multi-label text classification
Python
12
star
25

medlama

Python
12
star
26

post-specialisation

Post-Specialisation: Retrofitting Vectors of Words Unseen in Lexical Resources
Python
12
star
27

MirrorWiC

[CoNLL'21] MirrorWiC: On Eliciting Word-in-Context Representationsfrom Pretrained Language Models
Python
11
star
28

e2e_tod_toolkit

A codebase for e2e ToD toolkit.
Python
10
star
29

sw_study

Roff
9
star
30

nn_for_LBD

Repository for paper 'Neural networks for open and closed Literature-based Discovery'
Python
9
star
31

chat

Python
9
star
32

lionlbd

Source code for the LION LBD Tool
JavaScript
9
star
33

prompt4bli

On Bilingual Lexicon Induction with Large Language Models (EMNLP 2023). Keywords: Bilingual Lexicon Induction, Word Translation, Large Language Models, LLMs.
Python
9
star
34

zepo

Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments (Zhou et al.)
Python
8
star
35

cancer-hallmark-cnn

Cancer hallmark CNN
Python
7
star
36

HELIN

Demo Entity Linking API for the HDR Text Analytic Team.
Python
7
star
37

COD

6
star
38

iso-study

Data sets and comparable Wikipedia samples used in our study on near-isomorphism between monolingual word embeddings
Python
6
star
39

hyperlex

HyperLex: a gold standard resource for measuring and evaluating how well semantic models capture graded or soft lexical entailment
5
star
40

ensembled-sicl

Python
4
star
41

RepEval-2016

Python
4
star
42

POSQA

Offical Repo of EMNLP Findings 2023 Paper: POSQA: Probe the World Models of LLMs with Size Comparisons
Python
4
star
43

bio-verbnet

Contains materials for BioVerbnet
4
star
44

panlex-bli

Bilingual lexicon induction (BLI) training and test sets extracted from PanLex - used in the work of Vulić et al. (EMNLP 2019)
4
star
45

bio-simverb

Python
4
star
46

bioverbnet

BioVerbNet: A large semantic-syntacticclassification of verbs in biomedicine
3
star
47

retrofitted-bio-embeddings

Bio word embeddings retrofitted to verb clusters
Python
3
star
48

mling_sdgms

Python
3
star
49

response_reranking

Code repository for Reranking Overgenerated Responses for End-to-End Task-Oriented Dialogue Systems (LREC-COLING 2024)
Python
3
star
50

sqatin

Code for Paper "SQATIN: Supervised Instruction Tuning Meets Question Answering for Improved Dialogue NLU". Published at NAACL-2024 (main conference)
Python
2
star
51

fs-wrep

Pretrained function-specific vectors (Gerz et al., ACL 2020)
2
star
52

xling-postspec

Cross-lingual Semantic Specialization via Lexical Relation Induction
Python
2
star
53

biocaster_2021

This is a public repo for codes and resources of BioCaster 2021: http://www.biocaster.org
Java
2
star
54

sail-bli

Self-Augmented In-Context Learning for Unsupervised Word Translation (ACL 2024). Keywords: Bilingual Lexicon Induction, Word Translation, Large Language Models, LLMs.
Python
1
star
55

bmip-2017-practical

BMIP 2017 practical
1
star
56

deductive_reasoning_probing

Jupyter Notebook
1
star
57

uniprotidmap

UniProt ID mappings
Python
1
star
58

bmip-2018

Resources for BMIP ticked practical
Python
1
star