• Stars
    star
    321
  • Rank 130,793 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[EMNLP 2020] OpenUE: An Open Toolkit of Universal Extraction from Text

中文说明 | English

OpenUE is a lightweight toolkit for knowledge graph extraction.

GitHub Documentation

OpenUE 是一个轻量级知识图谱抽取工具。

特点

  • 基于预训练语言模型的知识图谱抽取任务 (兼容BERT, Roberta等预训练模型.)
    • 实体关系抽取
    • 事件抽取
    • 槽位和意图抽取
    • 更多的任务
  • 训练和测试接口
  • 快速部署NLP模型

环境

  • python3.8
  • requirements.txt

框架图

框架

其中主要分为三个模块,models,lit_modelsdata模块。

models 模块

其存放了我们主要的三个模型,针对整句的关系识别模型,针对已知句中关系的命名实体识别模型,还有将前两者整合起来的推理验证模型。其主要源自transformers库中的已定义好的预训练模型。

lit_models 模块

其中的代码主要继承自pytorch_lightning.Trainer。其可以自动构建单卡,多卡,GPU,TPU等不同硬件下的模型训练。我们在其中定义了training_stepsvalidation_step即可自动构建训练逻辑进行训练。

由于其硬件不敏感,所以我们可以使用多种不同环境下调用OpenUE训练模块。

data 模块

data中存放了针对不同数据集进行不同操作的代码。使用了transformers库中的tokenizer先对数据进行分词处理再根据不同需要将数据变成我们需要的features。

快速开始

安装

Anaconda 环境

conda create -n openue python=3.8
conda activate openue
pip install -r requirements.txt
conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c nvidia # 视自己Nvidia驱动环境选择对应的cudatoolkit版本
python setup.py install

pip安装

pip install openue

pip本地开发

python setup.py develop

使用方式

数据格式为json文件,具体例子如下。

{
	"text": "查尔斯·阿兰基斯(Charles Aránguiz),1989年4月17日出生于智利圣地亚哥,智利职业足球运动员,司职中场,效力于德国足球甲级联赛勒沃库森足球俱乐部",
	"spo_list": [{
		"predicate": "出生地",
		"object_type": "地点",
		"subject_type": "人物",
		"object": "圣地亚哥",
		"subject": "查尔斯·阿兰基斯"
	}, {
		"predicate": "出生日期",
		"object_type": "Date",
		"subject_type": "人物",
		"object": "1989年4月17日",
		"subject": "查尔斯·阿兰基斯"
	}]
}

训练模型

将数据存放在./dataset/目录下之后进行训练。如目录为空,运行以下脚本,将自动下载数据集和预训练模型并开始训练,过程中请保持网络畅通以免模型和数据下载失败。

# 训练NER命名实体识别模块
./scripts/run_ner.sh
# 训练SEQ句中关系分类模块
./scripts/run_seq.sh

下面使用一个小demo简要展示训练过程,其中仅训练一个batch来加速展示。 框架

验证模型

由于我们使用pipeline模型,所以无法联合训练,需要分别训练后进行统一验证。 在运行了两个训练脚本后,在output路径下会得到两个模型权重output/ner/${dataset}以及output/seq/${dataset}根据不同数据集放在对应的目录中。将模型权重目录分别作为ner_model_name_or_pathseq_model_name_or_path输入到 run_infer.yaml或者是run_infer.sh运行脚本中,即可进行验证。

Notebook快速开始

ske数据集训练notebook 使用中文数据集作为例子具体介绍了如何使用openue中的lit_models,modelsdata。方便用户构建自己的训练逻辑。

Colab 打开 使用colab云端环境,无需配置环境。

支持自动调参(wandb)

# 在代码中将logger 部分替换成wandb logger即可支持wandb
logger = pl.loggers.WandbLogger(project="openue")

支持英文

针对英文数据集,唯一需要改变的参数为model_name_or_path即预训练语言模型的权重参数,由于transformers库强大的兼容性,所以针对英文只需要将原先的中文预训练语言模型bert-base-chinese改为英文的预训练语言模型bert-base-uncased即可运行。

快速部署模型

下载torchserve-docker

docker下载

创建模型对应的handler类

我们已经在deploy文件夹下放置了对应的部署类handler_seq.pyhandler_ner.py

# 使用torch-model-archiver 将模型文件进行打包,其中
# extra-files需要加入以下文件 
# config.json, setup_config.json 针对模型和推理的配置config。 
# vocab.txt : 分词器tokenizer所使用的字典
# model.py : 模型具体代码

torch-model-archiver --model-name BERTForNER_en  \
	--version 1.0 --serialized-file ./ner_en/pytorch_model.bin \
	--handler ./deploy/handler.py \
	--extra-files "./ner_en/config.json,./ner_en/setup_config.json,./ner_en/vocab.txt,./deploy/model.py" -f

# 将打包好的.mar文件加入到model-store文件夹下,并使用curl命令将打包的文件部署到docker中。
sudo cp ./BERTForSEQ_en.mar /home/model-server/model-store/
curl -v -X POST "http://localhost:3001/models?initial_workers=1&synchronous=false&url=BERTForSEQ_en.mar&batch_size=1&max_batch_delay=200"

项目成员

浙江大学:张宁豫、谢辛、毕祯、王泽元、陈想、余海阳、邓淑敏、叶宏彬、田玺、郑国轴、陈华钧

达摩院:陈漠沙、谭传奇、黄非


引用

如果您使用或扩展我们的工作,请引用以下文章:

@inproceedings{DBLP:conf/emnlp/ZhangDBYYCHZC20,
  author    = {Ningyu Zhang and
               Shumin Deng and
               Zhen Bi and
               Haiyang Yu and
               Jiacheng Yang and
               Mosha Chen and
               Fei Huang and
               Wei Zhang and
               Huajun Chen},
  editor    = {Qun Liu and
               David Schlangen},
  title     = {OpenUE: An Open Toolkit of Universal Extraction from Text},
  booktitle = {Proceedings of the 2020 Conference on Empirical Methods in Natural
               Language Processing: System Demonstrations, {EMNLP} 2020 - Demos,
               Online, November 16-20, 2020},
  pages     = {1--8},
  publisher = {Association for Computational Linguistics},
  year      = {2020},
  url       = {https://doi.org/10.18653/v1/2020.emnlp-demos.1},
  doi       = {10.18653/v1/2020.emnlp-demos.1},
  timestamp = {Wed, 08 Sep 2021 16:17:48 +0200},
  biburl    = {https://dblp.org/rec/conf/emnlp/ZhangDBYYCHZC20.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

其他开源知识抽取工具

More Repositories

1

DeepKE

[EMNLP 2022] An Open Toolkit for Knowledge Graph Extraction and Construction
Python
3,517
star
2

EasyEdit

[ACL 2024] An Easy-to-use Knowledge Editing Framework for LLMs.
Jupyter Notebook
1,815
star
3

LLMAgentPapers

Must-read Papers on LLM Agents.
1,683
star
4

KnowLM

An Open-sourced Knowledgable Large Language Model Framework.
Python
1,209
star
5

Prompt4ReasoningPapers

[ACL 2023] Reasoning with Language Model Prompting: A Survey
863
star
6

KnowledgeEditingPapers

Must-read Papers on Knowledge Editing for Large Language Models.
843
star
7

PromptKG

PromptKG Family: a Gallery of Prompt Learning & KG-related research works, toolkits, and paper-list.
Python
690
star
8

EasyInstruct

[ACL 2024] An Easy-to-use Instruction Processing Framework for LLMs.
Python
357
star
9

AutoKG

LLMs for Knowledge Graph Construction and Reasoning: Recent Capabilities and Future Opportunities
Python
345
star
10

Mol-Instructions

[ICLR 2024] Mol-Instructions: A Large-Scale Biomolecular Instruction Dataset for Large Language Models
Python
233
star
11

KnowPrompt

[WWW 2022] KnowPrompt: Knowledge-aware Prompt-tuning with Synergistic Optimization for Relation Extraction
Python
194
star
12

MKGformer

[SIGIR 2022] Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion
Python
167
star
13

KnowAgent

KnowAgent: Knowledge-Augmented Planning for LLM-Based Agents
Python
163
star
14

AutoAct

[ACL 2024] AUTOACT: Automatic Agent Learning from Scratch for QA via Self-Planning
Python
162
star
15

IEPile

[ACL 2024] IEPile: A Large-Scale Information Extraction Corpus
Python
158
star
16

OntoProtein

[ICLR 2022] OntoProtein: Protein Pretraining With Gene Ontology Embedding
Python
141
star
17

DocuNet

[IJCAI 2021] Document-level Relation Extraction as Semantic Segmentation
Python
130
star
18

DART

[ICLR 2022] Differentiable Prompt Makes Pre-trained Language Models Better Few-shot Learners
Python
127
star
19

MolGen

[ICLR 2024] Domain-Agnostic Molecular Generation with Chemical Feedback
Python
124
star
20

Relphormer

[Neurocomputing 2023] Relational Graph Transformer for Knowledge Graph Representation
Python
120
star
21

Low-resource-KEPapers

A Paper List of Low-resource Information Extraction
114
star
22

OneGen

[EMNLP 2024 Findings] OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs.
Python
114
star
23

Generative_KG_Construction_Papers

[EMNLP 2022] Generative Knowledge Graph Construction: A Review
104
star
24

HVPNeT

[NAACL 2022 Findings] Good Visual Guidance Makes A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction
Python
97
star
25

MachineSoM

[ACL 2024] Exploring Collaboration Mechanisms for LLM Agents: A Social Psychology View
Python
91
star
26

MKG_Analogy

[ICLR 2023] Multimodal Analogical Reasoning over Knowledge Graphs
Python
89
star
27

FactCHD

[IJCAI 2024] FactCHD: Benchmarking Fact-Conflicting Hallucination Detection
Python
78
star
28

NLP4SciencePapers

Must-read papers on NLP for science.
53
star
29

KNN-KG

[NLPCC 2023] Reasoning Through Memorization: Nearest Neighbor Knowledge Graph Embeddings with Language Models
Python
49
star
30

KnowledgeCircuits

Knowledge Circuits in Pretrained Transformers
Python
47
star
31

ChatCell

ChatCell: Facilitating Single-Cell Analysis with Natural Language
Python
42
star
32

RAP

[SIGIR 2023] Schema-aware Reference as Prompt Improves Data-Efficient Knowledge Graph Construction
Python
39
star
33

DeepEE

DeepEE: Deep Event Extraction Algorithm Gallery (基于深度学习的开源中文事件抽取算法汇总)
Python
39
star
34

TRICE

[NAACL 2024] Making Language Models Better Tool Learners with Execution Feedback
Python
36
star
35

DocED

[ACL 2021] MLBiNet: A Cross-Sentence Collective Event Detection Network
Python
35
star
36

Kformer

[NLPCC 2022] Kformer: Knowledge Injection in Transformer Feed-Forward Layers
Python
34
star
37

LREBench

[EMNLP 2022 Findings] Towards Realistic Low-resource Relation Extraction: A Benchmark with Empirical Baseline Study
Python
33
star
38

ContinueMKGC

[IJCAI 2024] Continual Multimodal Knowledge Graph Construction
Python
32
star
39

IEDatasetZoo

Information Extraction Dataset Zoo.
31
star
40

WKM

Agent Planning with World Knowledge Model
Python
30
star
41

DiagnoseRE

[CCKS 2021] On Robustness and Bias Analysis of BERT-based Relation Extraction
Python
27
star
42

OceanGPT

[ACL 2024] OceanGPT: A Large Language Model for Ocean Science Tasks
25
star
43

PitfallsKnowledgeEditing

[ICLR 2024] Unveiling the Pitfalls of Knowledge Editing for Large Language Models
Python
22
star
44

AdaKGC

[EMNLP 2023 (Findings)] Schema-adaptable Knowledge Graph Construction
Python
17
star
45

knowledge-rumination

[EMNLP 2023] Knowledge Rumination for Pre-trained Language Models
Python
16
star
46

KnowUnDo

[EMNLP 2024 Findings] To Forget or Not? Towards Practical Knowledge Unlearning for Large Language Models
Python
16
star
47

EasyDetect

[ACL 2024] An Easy-to-use Hallucination Detection Framework for LLMs.
Python
16
star
48

OneEdit

OneEdit: A Neural-Symbolic Collaboratively Knowledge Editing System.
Python
15
star
49

SPEECH

[ACL 2023] SPEECH: Structured Prediction with Energy-Based Event-Centric Hyperspheres
Python
13
star
50

NLPCC2024_RegulatingLLM

[NLPCC 2024] Shared Task 10: Regulating Large Language Models
13
star
51

SemEval2021Task4

The 4th rank system of the SemEval 2021 Task4.
Python
10
star
52

Revisit-KNN

[CCL 2023] Revisiting k-NN for Fine-tuning Pre-trained Language Models
Python
10
star
53

EasyEval

An Easy-to-use Intelligence Evaluation Framework for LLMs.
Python
6
star
54

BiasEdit

Debiasing Stereotyped Language Models via Model Editing
Python
5
star
55

zjunlp.github.io

HTML
3
star
56

project

Project homepages for the NLP & KG Group of Zhejiang University
JavaScript
3
star
57

DQSetGen

[TASLP 2024] Sequence Labeling as Non-autoregressive Dual-Query Set Generation
Python
3
star
58

L2A

Python
2
star
59

KnowFM

2
star
60

EditBias

EditBias: Debiasing Stereotyped Language Models via Model Editing
Python
1
star