Lexicon Enhanced Chinese Sequence Labeling Using BERT Adapter
Code and checkpoints for the ACL2021 paper "Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter"
Arxiv link of the paper: https://arxiv.org/abs/2105.07148
If any questions, please contact the email: [email protected]
Requirement
- Python 3.7.0
- Transformer 3.4.0
- Numpy 1.18.5
- Packaging 17.1
- skicit-learn 0.23.2
- torch 1.6.0+cu92
- tqdm 4.50.2
- multiprocess 0.70.10
- tensorflow 2.3.1
- tensorboardX 2.1
- seqeval 1.2.1
Input Format
CoNLL format (prefer BIOES tag scheme), with each character its label for one line. Sentences are splited with a null line.
ηΎ B-LOC
ε½ E-LOC
η O
ε B-PER
θ± I-PER
士 E-PER
ζ O
θ· O
δ» O
θ° O
η¬ O
ι£ O
η O
Chinese BERTοΌChinese Word Embedding, and Checkpoints
Chinese BERT
Chinese BERT: https://huggingface.co/bert-base-chinese/tree/main
Chinese word embedding:
Word Embedding: https://ai.tencent.com/ailab/nlp/en/data/Tencent_AILab_ChineseEmbedding.tar.gz
The original download link does not work. We update it as:
Word Embedding: https://ai.tencent.com/ailab/nlp/en/data/tencent-ailab-embedding-zh-d200-v0.2.0.tar.gz
More info refers to: Tencent AI Lab Word Embedding
Checkpoints and Shells
- Weibo NER
- Ontonote4 NER
- MSRA NER
- Resume NER
- CTB5 POS
- CTB6 POS
- UD1 POS
- UD2 POS
- CTB6 CWS
- MSR CWS
- PKU CWS
Directory Structure of data
- berts
- bert
- config.json
- vocab.txt
- pytorch_model.bin
- bert
- dataset, you can download from here
- NER
- note4
- msra
- resume
- POS
- ctb5
- ctb6
- ud1
- ud2
- CWS
- ctb6
- msr
- pku
- NER
- vocab
- tencent_vocab.txt, the vocab of pre-trained word embedding table, downlaod from here.
- embedding
- word_embedding.txt
- result
- NER
- note4
- msra
- resume
- POS
- ctb5
- ctb6
- ud1
- ud2
- CWS
- ctb6
- msr
- pku
- NER
- log
Run
-
1.Convert .char.bmes file to .json file,
python3 to_json.py
-
2.run the shell,
sh run_demo.sh
If you want to load my checkpoints, you need to make some revisions to your transformers.
My model is trained in distribution mode so it can not be directly loaded by single-GPU mode. You can follow the below steps to revise the transformers before load my checkpoints.
-
Enter the source code director of Transformer,
cd source/transformers-master
-
Find the modeling_util.py, and positioned to about 995 lines
-
Compile the revised source code and install.
python3 setup.py install
Cite
@inproceedings{liu-etal-2021-lexicon,
title = "Lexicon Enhanced {C}hinese Sequence Labeling Using {BERT} Adapter",
author = "Liu, Wei and
Fu, Xiyan and
Zhang, Yue and
Xiao, Wenming",
booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
month = aug,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.acl-long.454",
doi = "10.18653/v1/2021.acl-long.454",
pages = "5847--5858"
}