• Stars
    star
    175
  • Rank 218,059 (Top 5 %)
  • Language
    Python
  • License
    Other
  • Created over 2 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Entity Disambiguation as text extraction (ACL 2022)

ExtEnD: Extractive Entity Disambiguation

Python Python PyTorch plugin: spacy Code style: black

This repository contains the code of ExtEnD: Extractive Entity Disambiguation, a novel approach to Entity Disambiguation (i.e. the task of linking a mention in context with its most suitable entity in a reference knowledge base) where we reformulate this task as a text extraction problem. This work was accepted at ACL 2022.

If you find our paper, code or framework useful, please reference this work in your paper:

@inproceedings{barba-etal-2021-extend,
    title = "{E}xt{E}n{D}: Extractive Entity Disambiguation",
    author = "Barba, Edoardo  and
      Procopio, Luigi  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics",
    month = may,
    year = "2022",
    address = "Online and Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
}

ExtEnD Image

ExtEnD is built on top of the classy library. If you are interested in using this project, we recommend checking first its introduction, although it is not strictly required to train and use the models.

Finally, we also developed a few additional tools that make it simple to use and test ExtEnD models:

Setup the environment

Requirements:

  • Debian-based (e.g. Debian, Ubuntu, ...) system
  • conda installed

To quickly setup the environment to use ExtEnd/replicate our experiments, you can use the bash script setup.sh. The only requirements needed here is to have a Debian-based system (Debian, Ubuntu, ...) and conda installed.

bash setup.sh

Checkpoints

We release the following checkpoints:

Model Training Dataset Avg Score
Longformer Large AIDA 85.8

Once you have downloaded the files, untar them inside the experiments/ folder.

# move file to experiments folder
mv ~/Downloads/extend-longformer-large.tar.gz experiments/
# untar
tar -xf experiments/extend-longformer-large.tar.gz -C experiments/
rm experiments/extend-longformer-large.tar.gz

Data

All the datasets used to train and evaluate ExtEnD can be downloaded using the following script from the facebook GENRE repository.

We strongly recommend you organize them in the following structure under the data folder as it is used by several scripts in the project.

data
├── aida
│   ├── test.aida
│   ├── train.aida
│   └── validation.aida
└── out_of_domain
    ├── ace2004-test-kilt.ed
    ├── aquaint-test-kilt.ed
    ├── clueweb-test-kilt.ed
    ├── msnbc-test-kilt.ed
    └── wiki-test-kilt.ed

Training

To train a model from scratch, you just have to use the following command:

classy train qa <folder> -n my-model-name --profile aida-longformer-large-gam -pd extend

can be any folder containing exactly 3 files:

  • train.aida
  • validation.aida
  • test.aida

This is required to let classy automatically discover the dataset splits. For instance, to re-train our AIDA-only model:

classy train data/aida -n my-model-name --profile aida-longformer-large-gam -pd extend

Note that can be any folder, as long as:

  • it contains these 3 files
  • they are in the same format as the files in data/aida

So if you want to train on these different datasets, just create the corresponding directory and you are ready to go!

In case you want to modify some training hyperparameter, you just have to edit the aida-longformer-large-gam profile in the configurations/ folder. You can take a look to the modifiable parameters by adding the parameter --print to the training command. You can find more on this in classy official documentation.

Predict

You can use classy syntax to perform file prediction:

classy predict -pd extend file \
    experiments/extend-longformer-large \
    data/aida/test.aida \
    -o data/aida_test_predictions.aida

Evaluation

To evaluate a checkpoint, you can run the bash script scripts/full_evaluation.sh, passing its path as an input argument. This will evaluate the model provided against both AIDA and OOD resources.

# syntax: bash scripts/full_evaluation.sh <ckpt-path>
bash scripts/full_evaluation.sh experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt

If you are interested in AIDA-only evaluation, you can use scripts/aida_evaluation.sh instead (same syntax).

Furthermore, you can evaluate the model on any dataset that respects the same format of the original ones with the following command:

classy evaluate \
    experiments/extend-longformer-large/2021-10-22/09-11-39/checkpoints/best.ckpt \
    data/aida/test.aida \
    -o data/aida_test_evaluation.txt \
    -pd extend

spaCy

You can also use ExtEnD with spaCy, allowing you to use our system with a seamless interface that tackles full end-to-end entity linking. To do so, you just need to have cloned the repo and run setup.sh to configure the environment. Then, you will be able to add extend as a custom component in the following way:

import spacy
from extend import spacy_component

nlp = spacy.load("en_core_web_sm")

extend_config = dict(
    checkpoint_path="<ckpt-path>",
    mentions_inventory_path="<inventory-path>",
    device=0,
    tokens_per_batch=4000,
)

nlp.add_pipe("extend", after="ner", config=extend_config)

input_sentence = "Japan began the defence of their title " \
                 "with a lucky 2-1 win against Syria " \
                 "in a championship match on Friday."

doc = nlp(input_sentence)

# [(Japan, Japan National Footbal Team), (Syria, Syria National Footbal Team)]
disambiguated_entities = [(ent.text, ent._.disambiguated_entity) for ent in doc.ents]

Where:

  • <ckpt-path> is the path to a pretrained checkpoint of extend that you can find in the Checkpoints section, and
  • <inventory-path> is the path to a file containing the mapping from mentions to the corresponding candidates.

We support two formats for <inventory-path>:

  • tsv:
    $ head -1 <inventory-path>
    Rome \[TAB\] Rome City \[TAB\] Rome Football Team \[TAB\] Roman Empire \[TAB\] ...
    That is, <inventory-path> is a tab-separated file where, for each row, we have the mention (Rome) followed by its possible entities.
  • sqlite: a sqlite3 database with a candidate table with two columns:
    • mention (text PRIMARY KEY)
    • entities (text). This must be a tab-separated list of the corresponding entities.

We release 6 possible pre-computed <inventory-path> that you could use (we recommend creating a folder data/inventories/ and placing the files downloaded there inside, e.g., = data/inventories/le-and-titov-2018-inventory.min-count-2.sqlite3):

Inventory Number of Mentions Source
le-and-titov-2018-inventory.min-count-2.tsv 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
[Recommended] le-and-titov-2018-inventory.min-count-2.sqlite3 12090972 Cleaned version of the candidate set released by Le and Titov (2018). We discard mentions whose count is less than 2.
le-and-titov-2018-inventory.tsv 21571265 The candidate set released by Le and Titov (2018)
le-and-titov-2018-inventory.sqlite3 21571265 The candidate set released by Le and Titov (2018)

Note that, as far as you respect either of these two formats, you can also create and use your own inventory!

Docker container

Finally, we also release a docker image running two services, a streamlit demo and a REST service:

$ docker run -p 22001:22001 -p 22002:22002 --rm -itd poccio/extend:1.0.1
<container id>

Now you can:

  • checkout the streamlit demo at http://127.0.0.1:22001/
  • invoke the REST service running at http://127.0.0.1:22002/ (http://127.0.0.1:22002/docs you can find the OpenAPI documentation):
    $ curl -X POST http://127.0.0.1:22002/ -H 'Content-Type: application/json' -d '[{"text": "Rome is in Italy"}]'
    [{"text":"Rome is in Italy","disambiguated_entities":[{"char_start":0,"char_end":4,"mention":"Rome","entity":"Rome"},{"char_start":11,"char_end":16,"mention":"Italy","entity":"Italy"}]}]

Acknowledgments

The authors gratefully acknowledge the support of the ERC Consolidator Grant MOUSSE No. 726487 under the European Union’s Horizon 2020 research and innovation programme.

This work was supported in part by the MIUR under grant “Dipartimenti di eccellenza 2018-2022” of the Department of Computer Science of the Sapienza University of Rome.

License

This work is under the Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0) license.

More Repositories

1

relik

Retrieve, Read and LinK: Fast and Accurate Entity Linking and Relation Extraction on an Academic Budget (ACL 2024)
Python
303
star
2

spring

SPRING is a seq2seq model for Text-to-AMR and AMR-to-Text (AAAI2021).
Python
125
star
3

ewiser

A Word Sense Disambiguation system integrating implicit and explicit external knowledge.
Python
66
star
4

consec

Text Extraction Formulation + Feedback Loop for state-of-the-art WSD (EMNLP 2021)
Python
52
star
5

maverick-coref

Python
26
star
6

esc

ESC: Redesigning WSD with Extractive Sense Comprehension
Python
23
star
7

mcl-wic

Semeval-2021 Multilingual and Cross-lingual Word-in-Context Task
18
star
8

gsrl

GSRL is a seq2seq model for end-to-end dependency- and span-based SRL (IJCAI2021).
Python
18
star
9

unify-srl

Unifying Cross-Lingual Semantic Role Labeling with Heterogeneous Linguistic Resources (NAACL-2021).
Python
17
star
10

xl-amr

XL-AMR is a sequence-to-graph cross-lingual AMR parser that exploits transfer learning (EMNLP2020).
Python
16
star
11

wsd-hard-benchmark

Data and code for "Nibbling at the Hard Core of Word Sense Disambiguation" (ACL 2022).
Python
15
star
12

usea

Universal Semantic Annotator (LREC 2022)
Shell
15
star
13

xl-wsd-code

Code to train and test Word Sense Disambiguation models based on different pretrained transformers.
Python
13
star
14

ita-bench

A collection of Italian benchmarks for LLM evaluation
Python
12
star
15

genesis

GeneSis is the first generative approach for lexical substitution (EMNLP 2021).
Python
12
star
16

srl4e

Python
11
star
17

conception

Code and experiments for the COLING2020 paper "Conception: Multilingually-Enhanced, Human-Readable Concept Vector Representations".
Java
11
star
18

multi-srl

Code and models for the COLING2020 paper "Bridging the Gap in Multilingual Semantic Role Labeling: a Language-Agnostic Approach".
Python
10
star
19

clubert

Distribution of word meanings in Wikipedia for English, Italian, French, German and Spanish.
10
star
20

steps

STEPS is a seq2seq model for Semantic Typing of Event Processes (AAAI 2022).
Python
8
star
21

LeakDistill

Python
8
star
22

bmr

Python
7
star
23

guardians-mt-eval

Official repository of the ACL 2024 paper "Guardians of the Machine Translation Meta-Evaluation: Sentinel Metrics Fall In!".
Python
7
star
24

neural-pagerank-wsd

Exploiting the global WordNet graph to perform WSD
Python
6
star
25

srl-pas-probing

Probing for Predicate Argument Structures in Pretrained Language Models (ACL 2022).
Python
5
star
26

mulan

Multilingual Label propagatioN for Word Sense Disambiguation
Python
5
star
27

MaTESe

MaTESe: Machine Translation Evaluation as a Sequence Tagging Problem
Python
5
star
28

mwsd-datasets

Semeval-2013 and -2015 multilingual WSD datasets for BabelNet 4.0
Shell
5
star
29

dsrl

Code for "Semantic Role Labeling meets Definition Modeling: using natural language to describe predicate-argument structures"
Perl
5
star
30

nounatlas

The NounAtlas frame inventory and model repository
Python
5
star
31

multilabel-wsd

A multi-labeling model for knowledge integration into Word Sense Disambiguation (EACL 2021).
Python
4
star
32

mosaico

A multilingual open-text semantically annotated interlinked corpus
Python
4
star
33

sir

SIR is a sense-enhanced Information Retrieval system for multiple languages (EMNLP2021).
Python
4
star
34

zebra

Python
3
star
35

nlp2020-hw1

Python
2
star
36

CLAP

Python
2
star
37

exploring-srl

Repository for the paper "Exploring Non-Verbal Predicates in Semantic Role Labeling: Challenges and Opportunities"
2
star
38

nlp2021-hw1

Python
2
star
39

nlp2023-hw1

Homework 1 for the NLP 2023 course
Python
2
star
40

united-srl

A unified dataset for span- and dependency-based multilingual and cross-lingual Semantic Role Labeling (EMNLP 2021).
2
star
41

ea-mt

Entity-Aware Machine Translation (SemEval-2025: Task 2)
JavaScript
2
star
42

nlp2022-hw1

Python
1
star
43

csi_code

Python
1
star
44

nlp2020-hw2

Python
1
star
45

alasca

New large-scale datasets for the task of lexical substitution (IJCAI 2021)
1
star
46

nlp2023-hw2

Homework 2 for the Multilingual NLP 2023 course
Python
1
star
47

XL-WA

1
star
48

visual-definition-modeling

Python
1
star