PL-Marker
Source code of the ACL 2022 paper "Packed Levitated Marker for Entity and Relation Extraction".
@tomaarsen develops a feasible and powerful library, SpanMarker, to apply PL-Marker to named entity recognition. SpanMarker takes advantage of the valuable functionality of 🤗 Transformers library, thus it can deal with a series of NER datasets with various encoders. Thanks to @tomaarsen for his insightful work!
Quick links
Overview
In this work, we present a novel span representation approach, named Packed Levitated Markers, to consider the dependencies between the spans (pairs) by strategically packing the markers in the encoder. Our approach is evaluated on two typical span (pair) representation tasks:
-
Named Entity Recognition (NER): Adopt a group packing strategy for enabling our model to process massive spans together to consider their dependencies with limited resources.
-
Relation Extraction (RE): Adopt a subject-oriented packing strategy for packing each subject and all its objects into an instance to model the dependencies between the same-subject span pairs
Please find more details of this work in our paper.
Setup
Install Dependencies
The code is based on huggaface's transformers.
Install dependencies and apex:
pip3 install -r requirement.txt
pip3 install --editable ./transformers
Download and preprocess the datasets
Our experiments are based on three datasets: ACE04, ACE05, and SciERC. Please find the links and pre-processing below:
- CoNLL03: We use the Enlish part of CoNLL03 from Google Drive/Tsinghua Cloud.
- OntoNotes: We use
preprocess_ontonotes.py
to preprocess the OntoNote 5.0. - Few-NERD: The dataseet can be downloaed in their website
- ACE04/ACE05: We use the preprocessing code from DyGIE repo. Please follow the instructions to preprocess the ACE05 and ACE04 datasets.
- SciERC: The preprocessed SciERC dataset can be downloaded from Google Drive/Tsinghua Cloud.
Input data format
The input data format for our models is JSONL. Each line of the input file contains one document in the following format.
{
# document ID (please make sure doc_key can be used to identify a certain document)
"doc_key": "CNN_ENG_20030306_083604.6",
# sentences in the document, each sentence is a list of tokens
"sentences": [
[...],
[...],
["tens", "of", "thousands", "of", "college", ...],
...
],
# entities (boundaries and entity type) in each sentence
"ner": [
[...],
[...],
[[26, 26, "LOC"], [14, 14, "PER"], ...], #the boundary positions are indexed in the document level
...,
],
# relations (two spans and relation type) in each sentence
"relations": [
[...],
[...],
[[14, 14, 10, 10, "ORG-AFF"], [14, 14, 12, 13, "ORG-AFF"], ...],
...
]
}
Trained Models
We release our trained NER models and RE models on ACE05 and SciERC datasets on Google Drive/Tsinghua Cloud. And we release our trained models on flat NER datasets on Tsinghua Cloud.
Note: the performance of the models might be slightly different from the reported numbers in the paper, since we reported the average numbers based on multiple runs.
Training Script
Download Pre-trained Language Models from Hugging Face:
mkdir -p bert_models/bert-base-uncased
wget -P bert_models/bert-base-uncased https://huggingface.co/bert-base-uncased/resolve/main/pytorch_model.bin
wget -P bert_models/bert-base-uncased https://huggingface.co/bert-base-uncased/resolve/main/vocab.txt
wget -P bert_models/bert-base-uncased https://huggingface.co/bert-base-uncased/resolve/main/config.json
mkdir -p bert_models/roberta-large
wget -P bert_models/roberta-large https://huggingface.co/roberta-large/resolve/main/pytorch_model.bin
wget -P bert_models/roberta-large https://huggingface.co/roberta-large/resolve/main/merges.txt
wget -P bert_models/roberta-large https://huggingface.co/roberta-large/resolve/main/vocab.json
wget -P bert_models/roberta-large https://huggingface.co/roberta-large/resolve/main/config.json
mkdir -p bert_models/albert-xxlarge-v1
wget -P bert_models/albert-xxlarge-v1 https://huggingface.co/albert-xxlarge-v1/resolve/main/pytorch_model.bin
wget -P bert_models/albert-xxlarge-v1 https://huggingface.co/albert-xxlarge-v1/resolve/main/spiece.model
wget -P bert_models/albert-xxlarge-v1 https://huggingface.co/albert-xxlarge-v1/resolve/main/config.json
wget -P bert_models/albert-xxlarge-v1 https://huggingface.co/albert-xxlarge-v1/resolve/main/tokenizer.json
mkdir -p bert_models/scibert_scivocab_uncased
wget -P bert_models/scibert_scivocab_uncased https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/pytorch_model.bin
wget -P bert_models/scibert_scivocab_uncased https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/vocab.txt
wget -P bert_models/scibert_scivocab_uncased https://huggingface.co/allenai/scibert_scivocab_uncased/resolve/main/config.json
Train NER Models:
bash scripts/run_train_ner_PLMarker.sh
bash scripts/run_train_ner_BIO.sh
bash scripts/run_train_ner_TokenCat.sh
Train RE Models:
bash run_train_re.sh
Quick Start
The following commands can be used to run our pre-trained models on SciERC.
Evaluate the NER model:
CUDA_VISIBLE_DEVICES=0 python3 run_acener.py --model_type bertspanmarker \
--model_name_or_path ../bert_models/scibert-uncased --do_lower_case \
--data_dir scierc \
--learning_rate 2e-5 --num_train_epochs 50 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 16 --gradient_accumulation_steps 1 \
--max_seq_length 512 --save_steps 2000 --max_pair_length 256 --max_mention_ori_length 8 \
--do_eval --evaluate_during_training --eval_all_checkpoints \
--fp16 --seed 42 --onedropout --lminit \
--train_file train.json --dev_file dev.json --test_file test.json \
--output_dir sciner_models/sciner-scibert --overwrite_output_dir --output_results
We need the ner result ent_pred_test.json
from the NER model with --output_results
. Then we evaluate the RE model:
CUDA_VISIBLE_DEVICES=0 python3 run_re.py --model_type bertsub \
--model_name_or_path ../bert_models/scibert-uncased --do_lower_case \
--data_dir scierc \
--learning_rate 2e-5 --num_train_epochs 10 --per_gpu_train_batch_size 8 --per_gpu_eval_batch_size 16 --gradient_accumulation_steps 1 \
--max_seq_length 256 --max_pair_length 16 --save_steps 2500 \
--do_eval --evaluate_during_training --eval_all_checkpoints --eval_logsoftmax \
--fp16 \
--test_file sciner_models/sciner-scibert/ent_pred_test.json \
--use_ner_results \
--output_dir scire_models/scire-scibert
Here, --use_ner_results
denotes using the original entity type predicted by NER models.
CoNLL03 with Dev
Model | Original split | Train with dev set |
---|---|---|
SeqTagger | 93.6 | 93.9 |
T-Concat | 93.0 | 93.3 |
PL-Marker | 94.0 | 94.2 |
Citation
If you use our code in your research, please cite our work:
@inproceedings{ye2022plmarker,
author = {Deming Ye and
Yankai Lin and
Peng Li and
Maosong Sun},
editor = {Smaranda Muresan and
Preslav Nakov and
Aline Villavicencio},
title = {Packed Levitated Marker for Entity and Relation Extraction},
booktitle = {Proceedings of the 60th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), {ACL} 2022, Dublin, Ireland,
May 22-27, 2022},
pages = {4904--4917},
publisher = {Association for Computational Linguistics},
year = {2022},
url = {https://aclanthology.org/2022.acl-long.337},
timestamp = {Wed, 18 May 2022 15:21:43 +0200},
biburl = {https://dblp.org/rec/conf/acl/YeL0S22.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}