• Stars
    star
    145
  • Rank 246,374 (Top 5 %)
  • Language
    Shell
  • License
    MIT License
  • Created almost 5 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿ“ A collection of common datasets used in knowledge embedding

datasets-knowledge-embedding

Project license

๐Ÿ“ A collection of common datasets used in knowledge embedding

Synopsis

This project collects different datasets used in various knowledge embedding related papers. It also standardizes the format of these datasets, making it easier to use them in the evaluation of new works.

The datasets can be downloaded from the release page.
For licensing information, please refer to the original dataset license file.

If you are using this collection of datasets please consider to start โญ๏ธ the project to support it.

Datasets format

Every subfolder in this repo is a single dataset.
Every folder contains the following 18 files.

File name Description
edges_as_text_{train,valid,test}.tsv These three files contain the three splits of the dataset where entities and relations are in a textual form (i.e. italy locatedin europe).
edges_as_text_all.tsv The concatenation of edges_as_text_train.tsv, edges_as_text_valid.tsv, and edges_as_text_test.tsv.
edges_as_id_{train,valid,test}.tsv These three files contain the three splits of the dataset where entities and relations are mapped to a numerical ID (i.e. 38 1 2). Entities and relations that are more frequent are mapped to lower integers (e.g. the entity/relation with ID 0 is the most frequent entity/relation in the dataset).
edges_as_id_all.tsv The concatenation of edges_as_id_train.tsv, edges_as_id_valid.tsv, and edges_as_id_test.tsv.
map_entity_id_to_text.tsv This file contains the mapping from numerical IDs used for entities in edges_as_id_*.tsv to the textual representation used in edges_as_text_*.tsv (i.e. 38 italy, 2 europe).
map_relation_id_to_text.tsv This file contains the mapping from numerical IDs used for relations in edges_as_id_*.tsv to the textual representation used in edges_as_text_*.tsv (i.e 1 locatedin).
frequency_entities_{all,train,valid,test}.tsv These files contain the frequency of each entity in the various splits of the dataset.
frequency_relations_{all,train,valid,test}.tsv These files contain the frequency of each relation in the various splits of the dataset.

Add a new dataset

If you want to add a new dataset to this collection, first you need to create three files called train.tsv, valid.tsv, and test.tsv containing respectively the edges for the three splits train, validation and test.
The files must contain tab-separated triples of the form (head entity, relation, tail entity).

Once you did this, you can simply process the three files with the following bash script.

bash build.sh train.tsv valid.tsv test.tsv .

The script uses the edgelist-mapper tool under the hood.

Datasets

The datasets are distributed in two formats, namely text-based and id-based (see the dataset format section for the difference).

COUNTRIES-S1

This dataset was introduced in On Approximate Reasoning Capabilities of Low-Rank Vector Spaces.
The link to the original dataset as released by the authors is unknown but a copy has been taken from here.

Entities Relation Types Edges Train Edges Validation Edges Test Edges
271 2 1159 1111 24 24

Download COUNTRIES-S1.tgz Download COUNTRIES-S1-ID.tgz

COUNTRIES-S2

This dataset was introduced in On Approximate Reasoning Capabilities of Low-Rank Vector Spaces.
The link to the original dataset as released by the authors is unknown but a copy has been taken from here.

Entities Relation Types Edges Train Edges Validation Edges Test Edges
271 2 1111 1063 24 24

Download COUNTRIES-S2.tgz Download COUNTRIES-S2-ID.tgz

COUNTRIES-S3

This dataset was introduced in On Approximate Reasoning Capabilities of Low-Rank Vector Spaces.
The link to the original dataset as released by the authors is unknown but a copy has been taken from here.

Entities Relation Types Edges Train Edges Validation Edges Test Edges
271 2 1033 985 24 24

Download COUNTRIES-S3.tgz Download COUNTRIES-S3-ID.tgz

FB15K

This dataset was introduced in Translating Embeddings for Modeling Multi-relational Data.
The original dataset as released by the authors is available here.

Entities in this dataset are represented trough the Freebase ids (i.e. /m/07l450, /film/film/genre, /m/082gq). Since they are hard to read we are considering to map them to Wikipedia pages (i.e. The_Last_King_of_scotland_(film), /film/film/genre, War_film).

Entities Relation Types Edges Train Edges Validation Edges Test Edges
14951 1345 592213 483142 50000 59071

Download FB15K.tgz Download FB15K-ID.tgz

FB15K-237

This dataset was introduced in Observed versus latent features for knowledge base and text inference.
The original dataset as released by the authors is available here.

Entities in this dataset are represented trough the Freebase ids (i.e. /m/07l450, /film/film/genre, /m/082gq). Since they are hard to read we are considering to map them to Wikipedia pages (i.e. The_Last_King_of_scotland_(film), /film/film/genre, War_film).

Entities Relation Types Edges Train Edges Validation Edges Test Edges
14541 237 310116 272115 17535 20466

Download FB15K-237.tgz Download FB15K-237-ID.tgz

KINSHIP

This dataset was introduced in Learning systems of concepts with an infinite relational model.
The original dataset as released by the authors is available here.

Entities Relation Types Edges Train Edges Validation Edges Test Edges
104 25 10686 8544 1068 1074

Download KINSHIP.tgz Download KINSHIP-ID.tgz

NATIONS

This dataset was introduced in Learning systems of concepts with an infinite relational model.
The original dataset as released by the authors is available here.

Entities Relation Types Edges Train Edges Validation Edges Test Edges
14 55 1992 1592 199 201

Download NATIONS.tgz Download NATIONS-ID.tgz

UMLS

This dataset was introduced in Learning systems of concepts with an infinite relational model.
The original dataset as released by the authors is available here.

Entities Relation Types Edges Train Edges Validation Edges Test Edges
135 46 6529 5216 652 661

Download UMLS.tgz Download UMLS-ID.tgz

WN18

This dataset was introduced in Translating Embeddings for Modeling Multi-relational Data.
The original dataset as released by the authors is available here.

In the original dataset, the entities are represented trough the WordNet offset id (i.e. 01257145 derivationally_related_form 07488875), but the version distributed here has the offsets mapped to WordNet synsets that can be read by the nltk library (i.e. sensual.s.02 derivationally_related_form sensuality.n.01).

Entities Relation Types Edges Train Edges Validation Edges Test Edges
41105 18 151442 141442 5000 5000

Download WN18.tgz Download WN18-ID.tgz

WN18RR

This dataset was introduced in Convolutional 2D Knowledge Graph Embeddings.
The original dataset as released by the authors is available here.

In the original dataset, the entities are represented trough the WordNet offset id (i.e. 01257145 derivationally_related_form 07488875), but the version distributed here has the offsets mapped to WordNet synsets that can be read by the nltk library (i.e. sensual.s.02 derivationally_related_form sensuality.n.01).

Entities Relation Types Edges Train Edges Validation Edges Test Edges
41105 11 93003 86835 3034 3134

Download WN18RR.tgz Download WN18RR-ID.tgz

YAGO3-10

This dataset was introduced in Convolutional 2D Knowledge Graph Embeddings.
The original dataset as released by the authors is available here.

Entities Relation Types Edges Train Edges Validation Edges Test Edges
123182 37 1089040 1079040 5000 5000

Download YAGO3-10.tgz Download YAGO3-10-ID.tgz

Authors

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the license file for details.

More Repositories

1

geo-maps

๐Ÿ—บ High Quality GeoJSON maps programmatically generated.
JavaScript
1,225
star
2

upash

๐Ÿ”’Unified API for password hashing algorithms
JavaScript
529
star
3

sympact

๐Ÿ”ฅ Stupid Simple CPU/MEM "Profiler" for your JS code.
JavaScript
439
star
4

lm-scorer

๐Ÿ“ƒLanguage Model based sentences scoring library
Python
296
star
5

country-iso

๐Ÿ—บ Get the ISO 3166-1 alpha-3 country code from geographic coordinates.
JavaScript
142
star
6

pidtree

๐Ÿšธ Cross platform children list of a PID.
JavaScript
124
star
7

geojson-geometries-lookup

โšก๏ธ Fast geometry in geometry lookup for large GeoJSONs.
JavaScript
88
star
8

osm-geojson

๐Ÿ”ฐ Get GeoJSON of a OpenStreetMap's relation from the API.
JavaScript
48
star
9

is-sea

๐ŸŒŠ Check whether a geographic coordinate is in the sea or not on the earth.
JavaScript
46
star
10

env-dot-prop

โ™ป๏ธ Get, set, or delete nested properties of process.env using a dot path
JavaScript
33
star
11

roboprime

๐Ÿค– Full featured 21 DOF 3D Printed Humanoid Robot based on ATmega328P chip
Arduino
23
star
12

competitive-programming

๐Ÿ… This repository contains all the problems I solved while training myself for programming competitions
C++
21
star
13

phc-argon2

๐Ÿ”’ Node.JS Argon2 password hashing algorithm following the PHC string format.
JavaScript
17
star
14

fitbit2garmin

โฌ‡ Downloads lifetime Fitbit data and exports it into the format supported by Garmin Connect data importer. This includes historical body composition data (weight, BMI, and fat percentage), activity data (calories burned, steps, distance, active minutes, and floors climbed), and individual GPS exercises (TCX).
Python
16
star
15

upash-cli

๐ŸŒŒ Hash password directly from your terminal
JavaScript
15
star
16

phc-format

๐Ÿ“ PHC String Format implementation for Node.JS
JavaScript
14
star
17

ni

๐Ÿ“ฆ A better `npm init` **NOT RELEASED**
JavaScript
12
star
18

phc-pbkdf2

๐Ÿ”’ Node.JS PBKDF2 password hashing algorithm following the PHC string format.
JavaScript
12
star
19

osm-countries

๐Ÿ”Ž Get the OpenStreetMap's relation id from a country code.
JavaScript
11
star
20

project-version

๐Ÿ‘€ Get the current version of your project.
JavaScript
10
star
21

fever-transformers

๐Ÿ“„ Evidence Retrieval and Claim Verification for the FEVER shared task using Transformer Networks
Python
10
star
22

varname-seq2seq

๐Ÿ“„Source code variable naming using a seq2seq architecture
Python
9
star
23

leadoii

๐Ÿ† Leaderboard Generator for the Italian Olympiads of Informatics Training Platform
Vue
8
star
24

bin-manager

๐ŸŒ€ Binaries available as local nodeJS dependencies
JavaScript
7
star
25

phc-bcrypt

๐Ÿ”’ Easy to use Unified API for bcrypt password hashing algorithm
JavaScript
6
star
26

phc-scrypt

๐Ÿ”’ Node.JS scrypt password hashing algorithm following the PHC string format.
JavaScript
6
star
27

act

โœ๏ธ Multi-purpose URI tracker.
JavaScript
6
star
28

tsse

โฑ Timing safe string equals.
JavaScript
3
star
29

text-tokenizers-colab

๐Ÿ”ช Tokenize text on the fly on Colab.
Jupyter Notebook
3
star
30

restify-errors-options

๐Ÿ”ง Add custom options to Restify's errors
JavaScript
3
star
31

leadoii-static

๐Ÿ…Pre-Generated Leaderboards of the Italian Olympiads of Informatics Training Platform Users
HTML
3
star
32

sudoku-solver

๐Ÿ”ข Sudoku Solutions Enumerator (Sequential and Parallel)
Java
2
star
33

restify-errors-thrower

๐Ÿ’ฅ Throw Restify errors easily!
JavaScript
2
star
34

kdf-salt

๐ŸŽฒ Crypto secure salt generator
JavaScript
2
star
35

docker-osrm-backend

๐Ÿ›ฃ The Open Source Routing Machine Docker ready!
Shell
2
star
36

geojson-geometries

โ› Extract elementary geometries from a GeoJSON inheriting properties.
JavaScript
2
star
37

css-viewport-units-cross-browser

Cross-Browser CSS3 Viewport Units: (vh, vw, vmin, vmax)
CSS
2
star
38

talking-unicorn

๐Ÿฆ„ An Arduino based greating unicorn.
Arduino
2
star
39

text2error

ใ€ฐ Introduce errors in error free text
Python
2
star
40

edgelist-mapper

๐Ÿ“ŠMaps nodes and edges of a multi-relational graph to integer
Python
1
star
41

ardutank

๐Ÿš— An Arduino based rover
C++
1
star
42

restify-errors-options-errno

โ˜Ž๏ธ Add errno to Restify's errors
JavaScript
1
star
43

rgcn-link-prediction-experiments

1
star