• Stars
    star
    338
  • Rank 124,931 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A comprehensive, unified and modular event extraction toolkit.

A comprehensive, unified and modular event extraction toolkit.


Demo PyPI Documentation License

Table of Contents

Overview

OmniEvent is a powerful open-source toolkit for event extraction, including event detection and event argument extraction. We comprehensively cover various paradigms and provide fair and unified evaluations on widely-used English and Chinese datasets. Modular implementations make OmniEvent highly extensible.

Highlights

  • Comprehensive Capability

    • Support to do Event Extraction at once, and also to independently do its two subtasks: Event Detection, Event Argument Extraction.
    • Cover various paradigms: Token Classification, Sequence Labeling, MRC(QA) and Seq2Seq.
    • Implement Transformer-based (BERT, T5, etc.) and classical (DMCNN, CRF, etc.) models.
    • Both Chinese and English are supported for all event extraction sub-tasks, paradigms and models.
  • Unified Benchmark & Evaluation

    • Various datasets are processed into a unified format.
    • Predictions of different paradigms are all converted into a unified candidate set for fair evaluations.
    • Four evaluation modes (gold, loose, default, strict) well cover different previous evaluation settings.
  • Modular Implementation

    • All models are decomposed into four modules:
      • Input Engineering: Prepare inputs and support various input engineering methods like prompting.
      • Backbone: Encode text into hidden states.
      • Aggregation: Fuse hidden states (e.g., select [CLS], pooling, GCN) to the final event representation.
      • Output Head: Map the event representation to the final outputs, such as Linear, CRF, MRC head, etc.
    • You can combine and reimplement different modules to design and implement your own new model.
  • Big Model Training & Inference

    • Efficient training and inference of big event extraction models are supported with BMTrain.
  • Easy to Use & Highly Extensible

    • Open datasets can be downloaded and processed with a single command.
    • Fully compatible with 🤗 Transformers and its Trainer.
    • Users can easily reproduce existing models and build customized models with OmniEvent.

Installation

With pip

This repository is tested on Python 3.9+, Pytorch 1.12.1+. OmniEvent can be installed with pip as follows:

pip install OmniEvent

From source

If you want to install the repository from local source, you can install as follows:

pip install .

And if you want to edit the repositoy, you can

pip install -e .

Easy Start

OmniEvent provides several off-the-shelf models for the users. Examples are shown below.

Make sure you have installed OmniEvent as instructed above. Note that it may take a few minutes to download checkpoint at the first time.

>>> from OmniEvent.infer import infer

>>> # Even Extraction (EE) Task
>>> text = "2022年北京市举办了冬奥会"
>>> results = infer(text=text, task="EE")
>>> print(results[0]["events"])
[
    {
        "type": "组织行为开幕", "trigger": "举办", "offset": [8, 10],
        "arguments": [
            {   "mention": "2022年", "offset": [9, 16], "role": "时间"},
            {   "mention": "北京市", "offset": [81, 89], "role": "地点"},
            {   "mention": "冬奥会", "offset": [0, 4], "role": "活动名称"},
        ]
    }
]

>>> text = "U.S. and British troops were moving on the strategic southern port city of Basra \ 
Saturday after a massive aerial assault pounded Baghdad at dawn"

>>> # Event Detection (ED) Task
>>> results = infer(text=text, task="ED")
>>> print(results[0]["events"])
[
    { "type": "attack", "trigger": "assault", "offset": [113, 120]},
    { "type": "injure", "trigger": "pounded", "offset": [121, 128]}
]

>>> # Event Argument Extraction (EAE) Task
>>> results = infer(text=text, triggers=[("assault", 113, 120), ("pounded", 121, 128)], task="EAE")
>>> print(results[0]["events"])
[
    {
        "type": "attack", "trigger": "assault", "offset": [113, 120],
        "arguments": [
            {   "mention": "U.S.", "offset": [0, 4], "role": "attacker"},
            {   "mention": "British", "offset": [9, 16], "role": "attacker"},
            {   "mention": "Saturday", "offset": [81, 89], "role": "time"}
        ]
    },
    {
        "type": "injure", "trigger": "pounded", "offset": [121, 128],
        "arguments": [
            {   "mention": "U.S.", "offset": [0, 4], "role": "attacker"},
            {   "mention": "Saturday", "offset": [81, 89], "role": "time"},
            {   "mention": "British", "offset": [9, 16], "role": "attacker"}
        ]
    }
]

Train your Own Model with OmniEvent

OmniEvent can help users easily train and evaluate their customized models on specific datasets.

We show a step-by-step example of using OmniEvent to train and evaluate an Event Detection model on ACE-EN dataset in the Seq2Seq paradigm. More examples are shown in examples.

Step 1: Process the dataset into the unified format

We provide standard data processing scripts for several commonly-used datasets. Checkout the details in scripts/data_processing.

dataset=ace2005-en  # the dataset name
cd scripts/data_processing/$dataset
bash run.sh

Step 2: Set up the customized configurations

We keep track of the configurations of dataset, model and training parameters via a single *.yaml file. See ./configs for details.

>>> from OmniEvent.arguments import DataArguments, ModelArguments, TrainingArguments, ArgumentParser
>>> from OmniEvent.input_engineering.seq2seq_processor import type_start, type_end

>>> parser = ArgumentParser((ModelArguments, DataArguments, TrainingArguments))
>>> model_args, data_args, training_args = parser.parse_yaml_file(yaml_file="config/all-datasets/ed/s2s/ace-en.yaml")

>>> training_args.output_dir = 'output/ACE2005-EN/ED/seq2seq/t5-base/'
>>> data_args.markers = ["<event>", "</event>", type_start, type_end]

Step 3: Initialize the model and tokenizer

OmniEvent supports various backbones. The users can specify the model and tokenizer in the config file and initialize them as follows.

>>> from OmniEvent.backbone.backbone import get_backbone
>>> from OmniEvent.model.model import get_model

>>> backbone, tokenizer, config = get_backbone(model_type=model_args.model_type, 
                           		       model_name_or_path=model_args.model_name_or_path, 
                           		       tokenizer_name=model_args.model_name_or_path, 
                           		       markers=data_args.markers,
                           		       new_tokens=data_args.markers)
>>> model = get_model(model_args, backbone)

Step 4: Initialize the dataset and evaluation metric

OmniEvent prepares the DataProcessor and the corresponding evaluation metrics for different task and paradigms.

Note that the metrics here are paradigm-dependent and are not used for the final unified evaluation.

>>> from OmniEvent.input_engineering.seq2seq_processor import EDSeq2SeqProcessor
>>> from OmniEvent.evaluation.metric import compute_seq_F1

>>> train_dataset = EDSeq2SeqProcessor(data_args, tokenizer, data_args.train_file)
>>> eval_dataset = EDSeq2SeqProcessor(data_args, tokenizer, data_args.validation_file)
>>> metric_fn = compute_seq_F1

Step 5: Define Trainer and train

OmniEvent adopts Trainer from 🤗 Transformers for training and evaluation.

>>> from OmniEvent.trainer_seq2seq import Seq2SeqTrainer

>>> trainer = Seq2SeqTrainer(
        args=training_args,
        model=model,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        compute_metrics=metric_fn,
        data_collator=train_dataset.collate_fn,
        tokenizer=tokenizer,
    )
>>> trainer.train()

Step 6: Unified Evaluation

Since the metrics in Step 4 depend on the paradigm, it is not fair to directly compare the performance of models in different paradigms.

OmniEvent evaluates models of different paradigms in a unified manner, where the predictions of different models are converted to predictions on the same candidate sets and then evaluated.

>>> from OmniEvent.evaluation.utils import predict, get_pred_s2s
>>> from OmniEvent.evaluation.convert_format import get_ace2005_trigger_detection_s2s

>>> logits, labels, metrics, test_dataset = predict(trainer=trainer, tokenizer=tokenizer, data_class=EDSeq2SeqProcessor,
                                                    data_args=data_args, data_file=data_args.test_file,
                                                    training_args=training_args)
>>> # paradigm-dependent metrics
>>> print("{} test performance before converting: {}".formate(test_dataset.dataset_name, metrics["test_micro_f1"]))  
ACE2005-EN test performance before converting: 66.4215686224377

>>> preds = get_pred_s2s(logits, tokenizer)
>>> # convert to the unified prediction and evaluate
>>> pred_labels = get_ace2005_trigger_detection_s2s(preds, labels, data_args.test_file, data_args, None)
ACE2005-EN test performance after converting: 67.41016109045849

For those datasets whose test set annotations are not public, such as MAVEN and LEVEN, OmniEvent provide scripts to generate submission files. See dump_result.py for details.

Supported Datasets & Models & Contests

Continually updated. Welcome to add more!

Datasets

Language Domain Task Dataset
English General ED MAVEN
General ED EAE ACE-EN
General ED EAE ACE-DYGIE
General ED EAE RichERE (KBP+ERE)
Chinese Legal ED LEVEN
General ED EAE DuEE
General ED EAE ACE-ZH
Financial ED EAE FewFC

Models

  • Paradigm
    • Token Classification (TC)
    • Sequence Labeling (SL)
    • Sequence to Sequence (Seq2Seq)
    • Machine Reading Comprehension (MRC)
  • Backbone
    • CNN / LSTM
    • Transformers (BERT, T5, etc.)
  • Aggregation
    • Select [CLS]
    • Dynamic/Max Pooling
    • Marker
    • GCN
  • Head
    • Linear / CRF / MRC heads

Experiments

We implement and evaluate state-of-the-art methods on some popular benchmarks using OmniEvent, and the results are shown in our ACL 2023 paper "The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation".

Citation

If our codes help you, please cite us:

@inproceedings{peng2023devil,
  title={The Devil is in the Details: On the Pitfalls of Event Extraction Evaluation},
  author={Peng, Hao and Wang, Xiaozhi and Yao, Feng and Zeng, Kaisheng and Hou, Lei and Li, Juanzi and Liu, Zhiyuan and Shen, Weixing},
  booktitle={Findings of ACL 2023},
  year={2023}
}

More Repositories

1

Entity_Alignment_Papers

Must-read papers on entity alignment published in recent years
530
star
2

EvaluationPapers4ChatGPT

Resource, Evaluation and Detection Papers for ChatGPT
451
star
3

Knowledge_Graph_Reasoning_Papers

Must-read papers on knowledge graph reasoning
429
star
4

EAkit

Entity Alignment toolkit (EAkit), a lightweight, easy-to-use and highly extensible PyTorch implementation of many entity alignment algorithms.
Python
194
star
5

KEPLER

Source code for TACL paper "KEPLER: A Unified Model for Knowledge Embedding and Pre-trained Language Representation".
Python
193
star
6

MAVEN-dataset

Source code and dataset for EMNLP 2020 paper "MAVEN: A Massive General Domain Event Detection Dataset".
Python
149
star
7

MetaKGR

Source codes and datasets for EMNLP 2019 paper "Adapting Meta Knowledge Graph Information for Multi-Hop Reasoning over Few-Shot Relations"
Python
113
star
8

ChatLog

⏳ ChatLog: Recording and Analysing ChatGPT Across Time
Jupyter Notebook
94
star
9

MOOCCubeX

A large-scale knowledge repository for adaptive learning, learning analytics, and knowledge discovery in MOOCs, hosted by THU KEG.
Python
84
star
10

CLEVE

Source code for ACL 2021 paper "CLEVE: Contrastive Pre-training for Event Extraction"
Python
80
star
11

KoPL

Knowledge Oriented Programming Language
Python
79
star
12

MAVEN-ERE

Source code and dataset for EMNLP 2022 paper "MAVEN-ERE: A Unified Large-scale Dataset for Event Coreference, Temporal, Causal, and Subevent Relation Extraction".
Python
76
star
13

KoLA

[ICLR24] The open-source repo of THU-KEG's KoLA benchmark.
Jupyter Notebook
50
star
14

DacKGR

Source codes and datasets for EMNLP 2020 paper "Dynamic Anticipation and Completion for Multi-Hop Reasoning over Sparse Knowledge Graph"
Jupyter Notebook
46
star
15

PKGC

Do Pre-trained Models Benefit Knowledge Graph Completion? A Reliable Evaluation and a Reasonable Approach
Python
43
star
16

EDUKG

EDUKG: a Heterogeneous Sustainable K-12 Educational Knowledge Graph
Python
39
star
17

KECG

Source code and datasets for EMNLP 2019 paper "Semi-supervised Entity Alignment via Joint Knowledge Embedding Model and Cross-graph Model".
Python
38
star
18

ADELIE

Aligning Large Language Models on Information Extraction
Python
30
star
19

MOOC-Radar

The data and source code for the paper "MoocRadar: A Fine-grained and Multi-aspect Knowledge Repository for Improving Cognitive Student Modeling in MOOCs"
Python
30
star
20

CCL2022_Storyline_Relationship_Classification

CCL2022 新闻脉络关系识别
29
star
21

TWAG

Code and dataset for the ACL 2021 paper "TWAG: A Topic-guided Wikipedia Abstract Generator"
Perl
20
star
22

COPEN

The official code and dataset for EMNLP 2022 paper "COPEN: Probing Conceptual Knowledge in Pre-trained Language Models".
Python
19
star
23

BIMR

Datasets and source codes for paper "Is Multi-Hop Reasoning Really Explainable? Towards Benchmarking Reasoning Interpretability"
19
star
24

WaterBench

[ACL2024-Main] Data and Code for WaterBench: Towards Holistic Evaluation of LLM Watermarks
Python
17
star
25

Skill-Neuron

Source code for EMNLP2022 paper "Finding Skill Neurons in Pre-trained Transformers via Prompt Tuning".
Python
16
star
26

SeaKR

Python
16
star
27

Awesome_MOOCs

This is a repo listing some must-read papers on *AI-driven MOOCs* or *Intelligent Education* published in recent years, mainly contributed by the MOOC team members at Knowledge Engineering Group ([KEG](http://keg.cs.tsinghua.edu.cn/)) of Tsinghua University.
15
star
28

ProgramTransfer

Official code and data of the ACL 2022 paper "Program Transfer for Complex Question Answering over Knowledge Bases"
Python
14
star
29

Entity-Linking-Trends-and-History

Papers about the trend of Entity Linking in recent years.
11
star
30

ProbTree

Source code for EMNLP 2023 paper "Probabilistic Tree-of-thought Reasoning for Answering Knowledge-intensive Complex Questions".
Python
11
star
31

Xlore2.0

Xlore2.0 Code[BaiduExtractor, HudongExtractor, WikiExtractor, XloreData, XloreWeb]
Java
10
star
32

MAVEN-Argument

Completing the Puzzle of All-in-One Event Understanding Benchmark with Event Arguments
Python
8
star
33

goal

Python
8
star
34

UPER

Code for the COLING22 paper "UPER: Boosting Multi-Document Summarization with an Unsupervised Prompt-based Extractor"
Python
8
star
35

Event-Level-Knowledge-Editing

Python
8
star
36

KoRC

Baseline for KoRC
Python
7
star
37

MOOC-NER

The code and dataset of ACL'23 paper "Distantly Supervised Course Concept Extraction in MOOCs with Academic Discipline"
Python
6
star
38

KB-Plugin

This is the accompanying code & data for the paper "KB-Plugin: A Plug-and-play Framework for Large Language Models to Induce Programs over Low-resourced Knowledge Bases".
Python
6
star
39

HIF-KAT

Python
5
star
40

DICE

DICE: Detecting In-distribution Data Contamination with LLM's Internal State
Python
5
star
41

Awesome-KBQA

4
star
42

ICLEA

Code and datasets for ICLEA: Interactive Contrastive Learning for Self-supervised Entity Alignment
4
star
43

ijcai13data

ijcai13-dataset-content-alignment
4
star
44

WikiExtrator

extractor for wikipedia dump files
Java
4
star
45

CStory

Data resource of CStory
Python
4
star
46

ARTE

4
star
47

MAVEN-FACT

Python
4
star
48

ClinicNER

ClinicNER experiments
Python
3
star
49

R-Eval

[KDD24-ADS] R-Eval: A Unified Toolkit for Evaluating Domain Knowledge of Retrieval Augmented Large Language Models
Python
3
star
50

VTA

Code, APIs and data for the CIKM23 paper "LittleMu: Deploying an Online Virtual Teaching Assistant via Heterogeneous Sources Integration and Chain-of-Teach Prompts"
3
star
51

ConstGCN

2
star
52

XAlias

XAlias: An Unsupervised Bilingual Entity Alias Discovery System with Multiple Sources
Python
2
star
53

IR4KGC

2
star
54

NGS

Source code for AACL-IJCNLP 2020 paper "Neural Gibbs Sampling for Joint Event Argument Extraction".
Python
2
star
55

SQC-Score

Python
2
star
56

LLMAEL

LLM-Augmented Entity Linking
Python
2
star
57

LLM_Reasoning_Papers

Papers on LLM Reasoning and Retrieval-Augmented LLM Reasoning
1
star
58

SafetyNeuron

Data and code for the paper: Finding Safety Neurons in Large Language Models
Jupyter Notebook
1
star
59

KNOT

1
star