• Stars
    star
    518
  • Rank 85,414 (Top 2 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created over 4 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs, VLDB 2020

A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs

Contributions Welcome License language-python3 made-with-Tensorflow Paper

Entity alignment seeks to find entities in different knowledge graphs (KGs) that refer to the same real-world object. Recent advancement in KG embedding impels the advent of embedding-based entity alignment, which encodes entities in a continuous embedding space and measures entity similarities based on the learned embeddings. In this paper, we conduct a comprehensive experimental study of this emerging field. This study surveys 23 recent embedding-based entity alignment approaches and categorizes them based on their techniques and characteristics. We further observe that current approaches use different datasets in evaluation, and the degree distributions of entities in these datasets are inconsistent with real KGs. Hence, we propose a new KG sampling algorithm, with which we generate a set of dedicated benchmark datasets with various heterogeneity and distributions for a realistic evaluation. This study also produces an open-source library, which includes 12 representative embedding-based entity alignment approaches. We extensively evaluate these approaches on the generated datasets, to understand their strengths and limitations. Additionally, for several directions that have not been explored in current approaches, we perform exploratory experiments and report our preliminary findings for future studies. The benchmark datasets, open-source library and experimental results are all accessible online and will be duly maintained.

*** UPDATE ***

  • Aug. 1, 2021: We release the source code for entity alignment with dangling cases.

  • June 29, 2021: We release the DBP2.0 dataset for entity alignment with dangling cases.

  • Jan. 8, 2021: The results of AliNet on OpenEA datasets are avaliable at Google docs.

  • Nov. 30, 2020: We release a new version (v2.0) of the OpenEA dataset, where the URIs of DBpedia and YAGO entities are encoded to resovle the name bias issue. It is strongly recommended to use the v2.0 dataset for evaluating attribute-based entity alignment methods, such that the results can better reflect the robustness of these methods in real-world situation.

  • Sep. 24, 2020: add AliNet.

Table of contents

  1. Library for Embedding-based Entity Alignment
    1. Overview
    2. Getting Started
      1. Code Package Description
      2. Dependencies
      3. Installation
      4. Usage
  2. KG Sampling Method and Datasets
    1. Iterative Degree-based Sampling
    2. Dataset Overview
    3. Dataset Description
  3. Experiment and Results
    1. Experiment Settings
    2. Detailed Results
  4. License
  5. Citation

Library for Embedding-based Entity Alignment

Overview

We use Python and Tensorflow to develop an open-source library, namely OpenEA, for embedding-based entity alignment. The software architecture is illustrated in the following Figure.

The design goals and features of OpenEA include three aspects, i.e., loose coupling, functionality and extensibility, and off-the-shelf solutions.

Getting Started

These instructions cover how to get a copy of the library and how to install and run it on your local machine for development and testing purposes. It also provides an overview of the package structure of the source code.

Package Description

src/
├── openea/
│   ├── approaches/: package of the implementations for existing embedding-based entity alignment approaches
│   ├── models/: package of the implementations for unexplored relationship embedding models
│   ├── modules/: package of the implementations for the framework of embedding module, alignment module, and their interaction
│   ├── expriment/: package of the implementations for evalution methods

Dependencies

  • Python 3.x (tested on Python 3.6)
  • Tensorflow 1.x (tested on Tensorflow 1.8 and 1.12)
  • Scipy
  • Numpy
  • Graph-tool or igraph or NetworkX
  • Pandas
  • Scikit-learn
  • Matching==0.1.1
  • Gensim

Installation

We recommend creating a new conda environment to install and run OpenEA. You should first install tensorflow-gpu (tested on 1.8 and 1.12), graph-tool (tested on 2.27 and 2.29, the latest version would cause a bug), and python-igraph using conda:

conda create -n openea python=3.6
conda activate openea
conda install tensorflow-gpu==1.12
conda install -c conda-forge graph-tool==2.29
conda install -c conda-forge python-igraph

Then, OpenEA can be installed using pip with the following steps:

git clone https://github.com/nju-websoft/OpenEA.git OpenEA
cd OpenEA
pip install -e .

Usage

The following is an example about how to use OpenEA in Python (We assume that you have already downloaded our datasets and configured the hyperparameters as in the examples.)

import openea as oa

model = oa.kge_model.TransE
args = load_args("hyperparameter file folder")
kgs = read_kgs_from_folder("data folder")
model.set_args(args)
model.set_kgs(kgs)
model.init()
model.run()
model.test()
model.save()

More examples are available here

To run the off-the-shelf approaches on our datasets and reproduce our experiments, change into the ./run/ directory and use the following script:

python main_from_args.py "predefined_arguments" "dataset_name" "split"

For example, if you want to run BootEA on D-W-15K (V1) using the first split, please execute the following script:

python main_from_args.py ./args/bootea_args_15K.json D_W_15K_V1 721_5fold/1/

KG Sampling Method and Datasets

As the current widely-used datasets are quite different from real-world KGs, we present a new dataset sampling algorithm to generate a benchmark dataset for embedding-based entity alignment.

Iterative Degree-based Sampling

The proposed iterative degree-based sampling (IDS) algorithm simultaneously deletes entities in two source KGs with reference alignment until achieving the desired size, meanwhile retaining a similar degree distribution of the sampled dataset as the source KG. The following figure describes the sampling procedure.

Dataset Overview

We choose three well-known KGs as our sources: DBpedia (2016-10),Wikidata (20160801) and YAGO3. Also, we consider two cross-lingual versions of DBpedia: English--French and English--German. We follow the conventions in JAPE and BootEA to generate datasets of two sizes with 15K and 100K entities, using the IDS algorithm:

# Entities Languages Dataset names
15K Cross-lingual EN-FR-15K, EN-DE-15K
15K English D-W-15K, D-Y-15K
100K Cross-lingual EN-FR-100K, EN-DE-100K
100K English-lingual D-W-100K, D-Y-100K

The v1.1 datasets used in this paper can be downloaded from figshare, Dropbox or Baidu Wangpan (password: 9feb). (Note that, we have fixed a minor format issue in YAGO of our v1.0 datasets. Please download our v1.1 datasets from the above links and use this version for evaluation.)

(Recommended) The v2.0 datasets can be downloaded from figshare, Dropbox or Baidu Wangpan (password: nub1).

Dataset Statistics

We generate two versions of datasets for each pair of KGs to be aligned. V1 is generated by directly using the IDS algorithm. For V2, we first randomly delete entities with low degrees (d <= 5) in the source KG to make the average degree doubled, and then execute IDS to fit the new KG. The statistics of the datasets are shown below.

Dataset Description

We hereby take the EN_FR_15K_V1 dataset as an example to introduce the files in each dataset. In the 721_5fold folder, we divide the reference entity alignment into five disjoint folds, each of which accounts for 20% of the total alignment. For each fold, we pick this fold (20%) as training data and leave the remaining (80%) for validation (10%) and testing (70%). The directory structure of each dataset is listed as follows:

EN_FR_15K_V1/
├── attr_triples_1: attribute triples in KG1
├── attr_triples_2: attribute triples in KG2
├── rel_triples_1: relation triples in KG1
├── rel_triples_2: relation triples in KG2
├── ent_links: entity alignment between KG1 and KG2
├── 721_5fold/: entity alignment with test/train/valid (7:2:1) splits
│   ├── 1/: the first fold
│   │   ├── test_links
│   │   ├── train_links
│   │   └── valid_links
│   ├── 2/
│   ├── 3/
│   ├── 4/
│   ├── 5/

Experiment and Results

Experiment Settings

The common hyper-parameters used for OpenEA are shown below.

15K 100K
Batch size for rel. triples 5,000 20,000
Termination condition Early stop when the Hits@1 score begins to drop on
the validation sets, checked every 10 epochs.
Max. epochs 2,000

Besides, it is well-recognized to split a dataset into training, validation and test sets. The details are shown below.

# Ref. alignment # Training # Validation # Test
15K 3,000 1,500 10,500
100K 20,000 10,000 70,000

We use Hits@m (m = 1, 5, 10, 50), mean rank (MR) and mean reciprocal rank (MRR) as the evaluation metrics. Higher Hits@m and MRR scores as well as lower MR scores indicate better performance.

Detailed Results

The detailed and supplementary experimental results are list as follows:

Detailed results of current approaches on the 15K datasets

detailed_results_current_approaches_15K.csv

Detailed results of current approaches on the 100K datasets

detailed_results_current_approaches_100K.csv

Running time (sec.) of current approaches

running_time.csv

Unexplored KG Embedding Models

Detailed results of unexplored KG embedding models on the 15K datasets

detailed_results_unexplored_models_15K.csv

Detailed results of unexplored KG embedding models on the 100K datasets

detailed_results_unexplored_models_100K.csv

License

This project is licensed under the GPL License - see the LICENSE file for details

Citation

If you find the benchmark datasets, the OpenEA library or the experimental results useful, please kindly cite the following paper:

@article{OpenEA,
  author    = {Zequn Sun and
               Qingheng Zhang and
               Wei Hu and
               Chengming Wang and
               Muhao Chen and
               Farahnaz Akrami and
               Chengkai Li},
  title     = {A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs},
  journal   = {Proceedings of the VLDB Endowment},
  volume    = {13},
  number    = {11},
  pages     = {2326--2340},
  year      = {2020},
  url       = {http://www.vldb.org/pvldb/vol13/p2326-sun.pdf}
}

If you use the DBP2.0 dataset, please kindly cite the following paper:

@inproceedings{DBP2,
  author    = {Zequn Sun and
               Muhao Chen and
               Wei Hu},
  title     = {Knowing the No-match: Entity Alignment with Dangling Cases},
  booktitle = {ACL},
  year      = {2021}
}

More Repositories

1

BootEA

Bootstrapping Entity Alignment with Knowledge Graph Embedding, IJCAI 2018
Python
151
star
2

KnowledgeGraphFusion

CCF ADL 2019 slides for knowledge graph fusion
141
star
3

MultiKE

Multi-view Knowledge Graph Embedding for Entity Alignment, IJCAI 2019
Python
114
star
4

muKG

μKG: A Library for Multi-source Knowledge Graph Embeddings and Applications, ISWC 2022
Python
109
star
5

RSN

Learning to Exploit Long-term Relational Dependencies in Knowledge Graphs, ICML 2019
Jupyter Notebook
99
star
6

AliNet

Knowledge Graph Alignment Network with Gated Multi-hop Neighborhood Aggregation, AAAI 2020
Python
98
star
7

JAPE

Cross-Lingual Entity Alignment via Joint Attribute-Preserving Embedding, ISWC 2017
Python
96
star
8

SPARQA

SPARQA: Skeleton-based Semantic Parsing for Complex Questions over Knowledge Bases (AAAI 2020)
Python
69
star
9

DSKG

Jupyter Notebook
66
star
10

GLRE

Global-to-Local Neural Networks for Document-Level Relation Extraction, EMNLP 2020
Python
53
star
11

HyperKA

Knowledge Association with Hyperbolic Knowledge Graph Embeddings, EMNLP 2020
Python
39
star
12

GenMC

Clues Before Answers: Generation-Enhanced Multiple-Choice QA (NAACL 2022)
Python
28
star
13

TransEdge

TransEdge: Translating Relation-contextualized Embeddings for Knowledge Graphs, ISWC 2019
Python
27
star
14

LKGE

Lifelong Embedding Learning and Transfer for Growing Knowledge Graphs, AAAI 2023
Python
26
star
15

AdaLoGN

AdaLoGN: Adaptive Logic Graph Network for Reasoning-Based Machine Reading Comprehension (ACL 2022)
Python
25
star
16

KBQA

KBQA demo
Python
24
star
17

CoLE

I Know What You Do Not Know: Knowledge Graph Embedding via Co-distillation Learning, CIKM 2022
Python
23
star
18

ContEA

Facing Changes: Continual Entity Alignment for Growing Knowledge Graphs, ISWC 2022
Python
21
star
19

FedLU

Heterogeneous Federated Knowledge Graph Embedding Learning and Unlearning, WWW 2023
Python
18
star
20

CKGG

CKGG: A Chinese Knowledge Graph for High-School Geography Education and Beyond (ISWC 2021)
Java
18
star
21

RGRec

Rule-Guided Graph Neural Networks for Recommender Systems, ISWC 2020
Python
17
star
22

TSQA

TSQA: Tabular Scenario Based Question Answering (AAAI 2021)
Python
17
star
23

NJU_KnowledgeFusionCourseExp

HTML
17
star
24

KGProgress2020fromSemWeb

从语义网视角看知识图谱的近期研究进展
17
star
25

DraCo

Dataflow-guided retrieval augmentation for repository-level code completion, ACL 2024 (main)
Python
17
star
26

MBE

Inductive Knowledge Graph Reasoning for Multi-batch Emerging Entities, CIKM 2022
Python
15
star
27

Knowformer

Python
15
star
28

TKGC

Trustworthy Knowledge Graph Completion Based on Multi-sourced Noisy Data, WWW 2022
Python
14
star
29

NJU_KEPractice

The final project for the Knowledge Engineering course at Nanjing University.
Java
13
star
30

PyCRE

Conflict-aware Inference of Python Compatible Runtime Environments with Domain Knowledge Graph, ICSE 2022
Python
13
star
31

OKELE

Open Knowledge Enrichment for Long-tail Entities, WWW 2020
Java
13
star
32

KnowLA

KnowLA: Enhancing Parameter-efficient Finetuning with Knowledgeable Adaptation, NAACL 2024
Python
13
star
33

KIRE

Enhancing Document-level Relation Extraction by Entity Knowledge Injection, ISWC 2022
Python
13
star
34

CCA

Knowledge Graph Error Detection with Contrastive Confidence Adaption, AAAI 2024
Python
13
star
35

EventEA

EventEA: Benchmarking Entity Alignment for Event-centric Knowledge Graphs
Python
11
star
36

SpanQualifier

Python
10
star
37

ESBM

ESBM: An Entity Summarization Benchmark (ESWC 2020)
10
star
38

KeyKG

Keyword Search over Knowledge Graphs via Static and Dynamic Hub Labelings (WWW 2020)
C++
10
star
39

One2Branch

Python
10
star
40

FAN

Knowing False Negatives: An Adversarial Training Method for Distantly Supervised Relation Extraction, EMNLP 2021
Python
9
star
41

FBPrompt

Python
9
star
42

DRESSED

Entity Summarization with User Feedback (ESWC 2020)
Python
9
star
43

LifeKE

基于链接实体回放的多源知识图谱终身表示学习
Python
9
star
44

SCR

Continual Event Extraction with Semantic Confusion Rectification, EMNLP 2023
Python
9
star
45

DIFT

Finetuning Generative Large Language Models with Discrimination Instructions for Knowledge Graph Completion, ISWC 2024
Python
9
star
46

SkeletonKBQA

Skeleton parsing for complex question answering over knowledge bases (JoWS 2022)
Python
8
star
47

Jeeves-GKMC

When Retriever-Reader Meets Scenario-Based Multiple-Choice Questions (Findings of EMNLP 2021)
Python
8
star
48

SCKD

Serial Contrastive Knowledge Distillation for Continual Few-shot Relation Extraction, Findings of ACL 2023
Python
8
star
49

EPR-KGQA

Enhancing Complex Question Answering over Knowledge Graphs through Evidence Pattern Retrieval, WWW 2024
Python
8
star
50

CEAR

Improving Continual Relation Extraction by Distinguishing Analogous Semantics, ACL 2023
Python
6
star
51

GeoCEQA

基于抽象事理图谱的因果简答题求解方法 (中文信息学报, 2022)
Python
6
star
52

DAEM

Deep Entity Matching with Adversarial Active Learning
Python
6
star
53

RepresentationLearning4KGs

Keynote at 3rd International Workshop on EntitY Retrieval and lEarning (EYRE '20)
6
star
54

ACORDAR

ACORDAR: A Test Collection for Ad Hoc Content-Based (RDF) Dataset Retrieval (SIGIR 2022)
6
star
55

NEST

Neural Entity Summarization with Joint Encoding and Weak Supervision (IJCAI 2020)
Python
5
star
56

MuKGE

Joint Pre-training and Local Re-training: Transferable Representation Learning on Multi-source Knowledge Graphs, KDD 2023
5
star
57

MAGIC

Multi-Aspect Controllable Text Generation with Disentangled Counterfactual Augmentation, ACL 2024 (main)
Python
5
star
58

DeepLENS

DeepLENS: Deep Learning for Entity Summarization (DL4KG 2020)
Python
5
star
59

nju-gpt

GPTs @ NJU
5
star
60

Unify-EA-SF

What Makes Entities Similar? A Similarity Flooding Perspective for Multi-sourced Knowledge Graph Embeddings, ICML 2023
Python
5
star
61

DAAKG

Deep Active Alignment of Knowledge Graph Entities and Schemata, SIGMOD 2023
Python
5
star
62

CORE

Generating Compact and Relaxable Answers to Keyword Queries over Knowledge Graphs (ISWC 2020)
Java
4
star
63

GREASE

GREASE: A Generative Model for Relevance Search over Knowledge Graphs (WSDM 2020)
Java
4
star
64

RoadEA

Revisiting Embedding-based Entity Alignment: A Robust and Adaptive Method, TKDE 2022
Python
4
star
65

DyRRen

DyRRen: A Dynamic Retriever-Reranker-Generator Model for Numerical Reasoning over Tabular and Textual Data (AAAI 2023)
Python
4
star
66

B3F

Keyword-Based Knowledge Graph Exploration Based on Quadratic Group Steiner Trees (IJCAI 2021)
Java
4
star
67

iESBM

实体摘要系统的解释性评测 (大数据, 2021)
Python
4
star
68

TTQA

基于图匹配网络的可解释知识图谱复杂问答方法 (计算机研究与发展, 2021)
Python
4
star
69

ReadPyE

Revisiting Knowledge-Based Inference of Python Runtime Environments: A Realistic and Adaptive Approach
Python
4
star
70

ARTime

Automatic Rule Generation for Time Expression Normalization (Findings of EMNLP, 2021)
Scala
3
star
71

GeoQA-GLM

Python
3
star
72

PCSG

PCSG: Pattern-Coverage Snippet Generation for RDF Datasets (ISWC 2021)
Java
3
star
73

CBA

Efficient Approximation Algorithms for the Diameter-Bounded Max-Coverage Group Steiner Tree Problem (WWW 2023)
Java
3
star
74

Remp

Relational match propagation
Python
3
star
75

BANDAR

BANDAR: Benchmarking Snippet Generation Algorithms for Dataset Search (TKDE)
Java
3
star
76

CertQR

Relaxing Relationship Queries on Graph Data (JoWS 2020)
Java
3
star
77

FormulaReasoning

FormulaReasoning: A Dataset for Formula-Based Numerical Reasoning
Python
3
star
78

PairCoder

A Pair Programming Framework for Code Generation via Multi-Plan Exploration and Feedback-Driven Refinement, ASE 2024
3
star
79

ACORDAR-2

[SIGIR 2024] ACORDAR 2.0: A Test Collection for Ad Hoc Dataset Retrieval with Densely Pooled Datasets and Question-Style Queries
Java
2
star
80

TRAVERS

TRAVERS: A Diversity-Based Dynamic Approach to Iterative Relevance Search over Knowledge Graphs (WWW 2023)
Java
2
star
81

DO4KG

C++
2
star
82

QGSTP

Efficient Computation of Semantically Cohesive Subgraphs for Keyword-Based Knowledge Graph Exploration (WWW 2021)
Java
2
star
83

FedChain

Python
2
star
84

ExEA

Generating Explanations to Understand and Repair Embedding-based Entity Alignment, ICDE 2024
JavaScript
2
star
85

DR2

[ISWC 2023] Dense Re-Ranking with Weak Supervision for RDF Dataset Search
Python
2
star
86

MStar

Expanding the Scope: Inductive Knowledge Graph Reasoning with Multi-Starting Progressive Propagation, ISWC 2024
Python
2
star
87

CADDIE

A prototype of content-based ad hoc dataset retrieval over RDF datasets.
Java
1
star
88

LogiNumBENCH

Python
1
star
89

QGSTP-BO

Java
1
star
90

QGSTP-HB

Java
1
star
91

INFO

Generating Characteristic Summaries for Entity Descriptions (TKDE)
1
star
92

SF-TQA

Python
1
star
93

TargetedTraining

Python
1
star
94

AHDR-KnowledgeEnhanced

An Empirical Investigation of Implicit and Explicit Knowledge-Enhanced Methods for Ad Hoc Dataset Retrieval (Findings of EMNLP 2023)
Python
1
star
95

CDS

[SIGIR 2024] Enhancing Dataset Search with Compact Data Snippets
Java
1
star
96

DUNKS

[ISWC 2024] DUNKS: Chunking and Summarizing Large and Heterogeneous Web Data for Dataset Search
Python
1
star