• Stars
    star
    116
  • Rank 303,894 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created over 2 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ReFinED is an efficient and accurate entity linking (EL) system.

ReFinED

Quickstart

pip install https://github.com/amazon-science/ReFinED/archive/refs/tags/V1.zip
from refined.inference.processor import Refined
refined = Refined.from_pretrained(model_name='wikipedia_model_with_numbers',
                                  entity_set="wikipedia")
spans = refined.process_text("<add_text_here>")

Overview

ReFinED is an entity linking (EL) system which links entity mentions in documents to their corresponding entities in Wikipedia or Wikidata (over 30M entities). The combination of accuracy, speed, and scalability of ReFinED means the system is capable of being deployed to extract entities from web-scale datasets with higher accuracy and an order of magnitude lower cost than existing approaches.

News

  • (November 2022)
    • Code refactoring πŸ”¨
    • Increased inference speed by 2x (replicates results from our paper) πŸ’¨
    • Released aida_model (trained on news articles) and questions_model (trained on questions) to replicate the results from our paper βœ…
    • New features πŸš€
      • Entity linking evaluation code
      • Fine-tuning script (allows use of custom datasets)
      • Training script
      • Data generation script (includes adding additional entities).

Hardware Requirements

ReFinED has a low hardware requirement. For fast inference speed, a GPU should be used, but this is not a strict requirement.

Model Architecture

In summary, ReFinED uses a Transformer model to perform mention detection, entity typing, and entity disambiguation for all mentions in a document in a single forward pass. The model is trained on a dataset we generated dataset using Wikipedia hyperlinks, which consists of over 150M entity mentions. The model uses entity descriptions and fine-grained entity types to perform linking. Therefore, new entities can be added to the system without retraining.

ReFinED Paper

The ReFinED model architecture is described in the paper below (https://arxiv.org/abs/2207.04108):

@inproceedings{ayoola-etal-2022-refined,
    title = "{R}e{F}in{ED}: An Efficient Zero-shot-capable Approach to End-to-End Entity Linking",
    author = "Tom Ayoola, Shubhi Tyagi, Joseph Fisher, Christos Christodoulopoulos, Andrea Pierleoni",
    booktitle = "NAACL",
    year = "2022"
}

Incorporating Knowledge Base Information Paper

The following paper is an extension of ReFinED which incorporates Knowledge Base (KB) information into the ED model in a fully differentiable and scalable manner (https://arxiv.org/abs/2207.04106):

@inproceedings{ayoola-etal-2022-improving,
    title = "Improving Entity Disambiguation by Reasoning over a Knowledge Base",
    author = "Tom Ayoola, Joseph Fisher, Andrea Pierleoni",
    booktitle = "NAACL",
    year = "2022"
}

Examples

While classical NER systems, such as widely used spaCy, classify entities to high-level classes (e.g. PERSON, LOCATION, NUMBER, ...; 26 in total for spaCy), ReFinED supports over 1k low-level classes (e.g. Human, Football Team, Politician, Screenwriter, Association Football Player, Guitarist, ...). As an example, for the sentence "England qualified for the 1970 FIFA World Cup in Mexico as reigning champions.", ReFinED predicts "England" β†’ {national football team} and "Mexico" β†’ {country}; while spaCy maps both "England" and "Mexico" β†’ {GPE - country}. Using fine-grained classes, the model is able to probabilistically narrow-down the set of possible candidates for "England" leading to correct disambiguation of the entity. Additionally, ReFinED uses textual descriptions of entities to perform disambiguation.

Library

Getting Started

The setup for ReFinED is very simple because the data files and datasets are downloaded automatically.

  1. Install the dependencies using the command below:
pip install -r requirments.txt

If the command above fails (which currently, happens on a Mac), run the commands below instead:

conda create -n refined38 -y python=3.8 && conda activate refined38
conda install -c conda-forge python-lmdb -y
pip install -r requirments.txt
  1. Add the src folder to your Python path. One way to do this is by running this command:
export PYTHONPATH=$PYTHONPATH:src
  1. Now you can use ReFinED is your code as follows:
from refined.inference.processor import Refined
refined = Refined.from_pretrained(...)

Importing ReFinED as a library

To import the ReFinED model into your existing code run the commands below (note that the conda commands are only needed on a Mac):

pip install https://github.com/amazon-science/ReFinED/archive/refs/tags/V1.zip

Alternatively, if the command above does not work, try the commands below which will install some dependencies using conda.

conda create -n refined38 -y python=3.8 && conda activate refined38
conda install -c conda-forge python-lmdb -y
git clone https://github.com/amazon-science/ReFinED.git
cd ReFinED
python setup.py bdist_wheel --universal
pip install dist/ReFinED-1.0-py2.py3-none-any.whl
cd ..

Inference - performing EL with a trained model

We have released several trained models that are ready to use. See the code below or example_scripts/refined_demo.py for a working example. Inference speed can be improved by setting use_precomputed_descriptions=True which increases disk usage.

from refined.inference.processor import Refined


refined = Refined.from_pretrained(model_name='wikipedia_model_with_numbers',
                                  entity_set="wikipedia")

spans = refined.process_text("England won the FIFA World Cup in 1966.")

print(spans)

Expected output:

[['England', Entity(wikidata_entity_id=Q47762, wikipedia_entity_title=England national football team), 'ORG'], ['FIFA World Cup', Entity(wikidata_entity_id=Q19317, wikipedia_entity_title=FIFA World Cup), 'EVENT'], ['1966', Entity(...), 'DATE']]

Note that str(span) only returns a few fields of the returned object for readability. Many other fields, such as top-k predictions and predicted fine-grained entity types, are also accessible from the returned Span.

Parameters

model_name: We provide four pretrained models

  1. 'wikipedia_model': This is the model which matches the setup described in the paper
  2. 'wikipedia_model_with_numbers': This model extends the above model, to also include detection of SpaCy numerical data types in the mention detection layer ("DATE", "CARDINAL", "MONEY", "PERCENT", "TIME", "ORDINAL", "QUANTITY"). The detected types are available at span.coarse_type. If the coarse_type is detected as "DATE", the date will be normalised to a standard format available at span.date. All non-numerical types will have a coarse_type of "MENTION", and will be passed through the entity disambiguation layer to attempt to resolve them to a wikidata entity.
  3. 'aida_model': This is the model which matches the setup described in the paper for fine-tuning the model on AIDA for entity linking. Note that this model is different to the model fine-tuned on AIDA for entity disambiguation only, which is also described in the paper.
  4. 'questions_model': This model is fine-tuned on short question text (lowercase text). The model was fine-tuned on the WebQSP EL dataset and the setup is described in our paper.

entity_set: Set to "wikidata" to resolve against all ~33M (after some filtering) entities in wikidata (requires more memory) or to "wikipedia" to limit to resolving against the ~6M entities which have a wikipedia page.

data_dir (optional): The local directory where the data/model files will be downloaded to/loaded from (defaults to ~/.cache/refined/).

download_files (optional): Set to True the first time the code is run, to automatically download the data/model files from S3 to your local directory. Files will not be downloaded if they already exist but network calls will still be made to compare timestamps.

use_precomputed_descriptions (optional): Set to True to use precomputed embeddings of all descriptions of entities in the knowledge base (speeds up inference).

device (optional): The device to load the model/run inference on.

Evaluation

Entity disambiguation

We provide the script replicate_results.py which replicates the results reported in our paper.

Entity disambiguation evaluation is run using the eval_all function:

from refined.inference.processor import Refined
from refined.evaluation.evaluation import eval_all

refined = Refined.from_pretrained(model_name='wikipedia_model',
                                  entity_set="wikipedia")

results_numbers = eval_all(refined=refined, el=False)

The script will automatically download the test dataset splits to ~/.cache/refined/. Please ensure you have the permission to use each dataset for your use case as defined by their independent licenses.

Expected results:

We show the expected results from the evaluation scripts below. The numbers for "wikipedia_model" with entity set "wikipedia" most closely match the numbers in the paper (they differ marginally as we have updated to a newer version of Wikipedia). For both models, performance on Wikidata entities is slightly lower, as all entities in the datasets are linked to Wikipedia entities (so adding Wikidata entities just adds a large quantity of entities that will never appear in the gold labels).

The performance of "wikipedia_model_with_numbers" is slightly lower, which is expected as the model is also trained to identify numerical types.

model_name entity_set AIDA MSNBC AQUAINT ACE2004 CWEB WIKI
wikipedia_model wikipedia 87.4 94.5 91.9 91.4 77.7 88.7
wikipedia_model wikidata 85.6 92.8 90.4 91.1 76.3 88.2
wikipedia_model_with_numbers wikipedia 85.1 93.5 90.3 91.7 76.4 89.4
wikipedia_model_with_numbers wikidata 84.9 93.6 90.0 91.2 75.8 88.9

Entity linking

Entity linking evaluation is run using the eval_all function with el=True:

from refined.inference.processor import Refined
from refined.evaluation.evaluation import eval_all

refined = Refined.from_pretrained(model_name='aida_model',
                                  entity_set="wikipedia")

results_numbers = eval_all(refined=refined, el=True)

The results below slightly differ from the ones reported in our paper (which were produced by the Gerbil framework using an older version of Wikipedia). The wikipedia_model is not trained on Wikipedia hyperlinks only. Whereas, the aida_model is fine-tuned on the AIDA training dataset (with the weights initialised from the wikipedia_model).

model_name entity_set AIDA MSNBC
aida_model wikipedia 85.0 75.1
wikipedia_model wikipedia 78.3 73.4

We observe that most EL errors on the AIDA dataset are actually dataset annotation errors. The AIDA dataset does not provide the entity label for every mention that can be linked to Wikipedia. Instead, many mentions are incorrectly labelled as NIL mentions, meaning no corresponding Wikipedia page was found for the mention (during annotation). This means that EL model predictions for these mentions will be unfairly considered as incorrect. To measure the impact, we added an option to filter out model predictions which exactly align with NIL mentions in the dataset:

eval_all(refined=refined, el=True, filter_nil_spans=True)

We report 90.2 F1 on the AIDA dataset when we set filter_nil_spans=True, when using our "aida_model".

Inference speed

We run the model over the AIDA dataset development split using our script, benchmark_on_aida.py.

Hardware Time taken to run EL on AIDA test dataset (231 news articles)
V100 GPU 6.5s
T4 GPU 7.4s
CPU 29.7s

The first time the model is loaded it will take longer because the data files need to be downloaded to disk.

Fine-tuning

See FINE_TUNING.md for instructions on how to fine-tune the model on standard and custom datasets.

Training

See TRAINING.md for instructions on how to train the model on our Wikipedia hyperlinks dataset.

Generating and updating the data files and training dataset

To regenerate all data files run the preprocess_all.py script. This script downloads the most recent Wikipedia and Wikidata dump and generates the data files and Wikipedia training dataset. Note that ReFinED is capable of zero-shot entity linking, which means the data files (which will include recently added entities) can be updated without having retraining the model.

Adding additional/custom entities

Additional entities (which are not in Wikidata) can be added to the entity set considered by ReFinED by running preprocess_all.py script with the argument --additional_entities_file <path_to_file>. The file must be a jsonlines file where each row is the JSON string for an AdditionalEntity. Ideally, the entity types provided should be Wikidata classes such as "Q5" for human.

Built With

  • PyTorch - PyTorch is an open source machine learning library based on the Torch library.
  • Transformers - Implementations of Transformer models.
  • Works with Python 3.8.10.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the CC-BY-NC 4.0 License.

Contact us

If you have questions please open Github issues instead of sending us emails, as some of the listed email addresses are no longer active.

More Repositories

1

mm-cot

Official implementation for "Multimodal Chain-of-Thought Reasoning in Language Models" (stay tuned and more will be updated)
Python
3,727
star
2

chronos-forecasting

Chronos: Pretrained (Language) Models for Probabilistic Time Series Forecasting
Python
2,202
star
3

auto-cot

Official implementation for "Automatic Chain of Thought Prompting in Large Language Models" (stay tuned & more will be updated)
Jupyter Notebook
1,218
star
4

patchcore-inspection

Python
479
star
5

siam-mot

SiamMOT: Siamese Multi-Object Tracking
Python
458
star
6

alexa-teacher-models

Python
362
star
7

bigdetection

BigDetection: A Large-scale Benchmark for Improved Object Detector Pre-training
Python
352
star
8

earth-forecasting-transformer

Official implementation of Earthformer
Jupyter Notebook
337
star
9

sccl

Pytorch implementation of Supporting Clustering with Contrastive Learning, NAACL 2021
Python
262
star
10

prompt-pretraining

Official implementation for the paper "Prompt Pre-Training with Over Twenty-Thousand Classes for Open-Vocabulary Visual Recognition"
Python
250
star
11

RefChecker

RefChecker provides automatic checking pipeline and benchmark dataset for detecting fine-grained hallucinations generated by Large Language Models.
Python
235
star
12

esci-data

Shopping Queries Dataset: A Large-Scale ESCI Benchmark for Improving Product Search
Python
154
star
13

video-contrastive-learning

Video Contrastive Learning with Global Context, ICCVW 2021
Python
146
star
14

tgl

Python
143
star
15

gan-control

This package provides a pythorch implementation of "GAN-Control: Explicitly Controllable GANs", ICCV 2021.
Jupyter Notebook
122
star
16

polygon-transformer

Python
120
star
17

tanl

Structured Prediction as Translation between Augmented Natural Languages
Python
113
star
18

unconditional-time-series-diffusion

Official PyTorch implementation of TSDiff models presented in the NeurIPS 2023 paper "Predict, Refine, Synthesize: Self-Guiding Diffusion Models for Probabilistic Time Series Forecasting"
Python
112
star
19

crossnorm-selfnorm

CrossNorm and SelfNorm for Generalization under Distribution Shifts, ICCV 2021
Python
111
star
20

cceval

CrossCodeEval: A Diverse and Multilingual Benchmark for Cross-File Code Completion (NeurIPS 2023)
Python
109
star
21

wqa_tanda

This repo provides code and data used in our TANDA paper.
106
star
22

spot-diff

Project for <SPot-the-Difference Self-Supervised Pre-training for Anomaly Detection and Segmentation> (ECCV 2022)
Python
101
star
23

mintaka

Dataset from the paper "Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering" (COLING 2022)
Python
101
star
24

mix-generation

MixGen: A New Multi-Modal Data Augmentation
Python
100
star
25

long-short-term-transformer

[NeurIPS 2021 Spotlight] Official implementation of Long Short-Term Transformer for Online Action Detection
Python
100
star
26

alexa-arena

Python
99
star
27

fraud-dataset-benchmark

Repository for Fraud Dataset Benchmark
Jupyter Notebook
96
star
28

glass-text-spotting

Official implementation for "GLASS: Global to Local Attention for Scene-Text Spotting" (ECCV'22)
Python
94
star
29

meta-q-learning

Code for the paper "Meta-Q-Learning"( ICLR 2020)
Python
92
star
30

exponential-moving-average-normalization

PyTorch implementation of EMAN for self-supervised and semi-supervised learning: https://arxiv.org/abs/2101.08482
Python
91
star
31

co-with-gnns-example

HTML
88
star
32

datatuner

Code related to "Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation with Semantic Fidelity" paper
Python
87
star
33

mxeval

Python
84
star
34

sentence-representations

Python
77
star
35

CodeSage

CodeSage: Code Representation Learning At Scale (ICLR 2024)
Python
75
star
36

semimtr-text-recognition

Multimodal Semi-Supervised Learning for Text Recognition (SemiMTR)
Python
75
star
37

fact-check-summarization

Python
72
star
38

instruct-video-to-video

Python
69
star
39

tabsyn

Official Implementations of "Mixed-Type Tabular Data Synthesis with Score-based Diffusion in Latent Space""
Python
68
star
40

object-centric-learning-framework

Python
67
star
41

omni-detr

PyTorch implementation of Omni-DETR for omni-supervised object detection: https://arxiv.org/abs/2203.16089
Python
64
star
42

progressive-coordinate-transforms

Progressive Coordinate Transforms for Monocular 3D Object Detection, NeurIPS 2021
Python
63
star
43

FeatGraph

Python
62
star
44

small-baseline-camera-tracking

A dataset to facilitate the research of Structure-from-Motion (SfM) for movie and TV shows.
61
star
45

tubelet-transformer

This is an official implementation of TubeR: Tubelet Transformer for Video Action Detection
Python
59
star
46

embert

Code for EmBERT, a transformer model for embodied, language-guided visual task completion.
Python
52
star
47

RAGChecker

RAGChecker: A Fine-grained Framework For Diagnosing RAG
Python
52
star
48

probconserv

Datasets and code for results presented in the ProbConserv paper
Python
50
star
49

semi-vit

PyTorch implementation of Semi-supervised Vision Transformers
Python
48
star
50

qa-dataset-converter

Code from the paper "What do Models Learn from Question Answering Datasets?" (EMNLP 2020)
Python
48
star
51

masked-diffusion-lm

Official implementation for the paper "A Cheaper and Better Diffusion Language Model with Soft-Masked Noise"
Python
48
star
52

transformer-gan

Python
47
star
53

transformers-data-augmentation

Code associated with the "Data Augmentation using Pre-trained Transformer Models" paper
Python
46
star
54

gluonmm

A library of transformer models for computer vision and multi-modality research
Python
46
star
55

crossmodal-contrastive-learning

CrossCLR: Cross-modal Contrastive Learning For Multi-modal Video Representations, ICCV 2021
Python
45
star
56

recode

Releasing code for "ReCode: Robustness Evaluation of Code Generation Models"
Python
44
star
57

tracking-dataset

Python
44
star
58

dstc11-track2-intent-induction

DSTC 11 Track 2: Intent Induction from Conversations for Task-Oriented Dialogue
Python
43
star
59

dse

Python
43
star
60

dq-bart

DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization (ACL 2022)
Python
43
star
61

gnn-tail-generalization

Python
43
star
62

auto-rag-eval

Code repo for the ICML 2024 paper "Automated Evaluation of Retrieval-Augmented Language Models with Task-Specific Exam Generation"
Python
42
star
63

boon

Datasets and code for results presented in the BOON paper
Jupyter Notebook
41
star
64

proteno

This repository contains data used in the NAACL 2021 Paper - Proteno: Text Normalization with Limited Data for Fast Deployment in Text to Speech Systems (https://arxiv.org/abs/2104.07777)
40
star
65

fact-graph

Implementation of the paper "FactGraph: Evaluating Factuality in Summarization with Semantic Graph Representations (NAACL 2022)"
Python
39
star
66

c2f-seg

Official Implementation for ICCV'23 paper Coarse-to-Fine Amodal Segmentation with Shape Prior (C2F-Seg).
Python
38
star
67

amazon-multilingual-counterfactual-dataset

37
star
68

QA-ViT

Python
37
star
69

indoor-scene-generation-eai

Jupyter Notebook
36
star
70

long-tailed-ood-detection

Official implementation for "Partial and Asymmetric Contrastive Learning for Out-of-Distribution Detection in Long-Tailed Recognition" (ICML'22 Long Presentation)
Python
36
star
71

efficient-longdoc-classification

Python
35
star
72

object-centric-multiple-object-tracking

Python
34
star
73

hyperbolic-embeddings

Code for hyperboloid embeddings for knowledge graph entities
Python
33
star
74

domain-knowledge-injection

Python
33
star
75

azcausal

Causal Inference in Python
Python
32
star
76

Repoformer

Repoformer: Selective Retrieval for Repository-Level Code Completion (ICML 2024)
Python
32
star
77

ContraCLM

[ACL 2023] Code for ContraCLM: Contrastive Learning For Causal Language Model
Python
31
star
78

unified-ept

A Unified Efficient Pyramid Transformer for Semantic Segmentation, ICCVW 2021
Python
29
star
79

robust-tableqa

Two approaches for robust TableQA: 1) ITR is a general-purpose retrieval-based approach for handling long tables in TableQA transformer models. 2) LI-RAGE is a robust framework for open-domain TableQA which addresses several limitations. (ACL 2023)
Python
29
star
80

bold

Dataset associated with "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation" paper
27
star
81

replay-based-recurrent-rl

Code for "Task-Agnostic Continual RL: In Praise of a Simple Baseline"
Python
26
star
82

controlling-llm-memorization

Python
25
star
83

carbon-assessment-with-ml

CaML: Carbon Footprinting of Household Products with Zero-Shot Semantic Text Similarity
Jupyter Notebook
25
star
84

peft-design-spaces

Official implementation for "Parameter-Efficient Fine-Tuning Design Spaces"
Python
24
star
85

llm-interpret

Code for the ACL 2023 paper: "Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale"
Python
24
star
86

creating-and-correcting-novel-ml-model-errors

Jupyter Notebook
24
star
87

BartGraphSumm

Implementation of the paper "Efficiently Summarizing Text and Graph Encodings of Multi-Document Clusters (NAACL 2021)"
Python
23
star
88

tofueval

23
star
89

wqa-cascade-transformers

21
star
90

textadain-robust-recognition

TextAdaIN: Paying Attention to Shortcut Learning in Text Recognizers
Python
21
star
91

multiatis

Data and code for the paper "End-to-End Slot Alignment and Recognition for Cross-Lingual NLU" (Accepted to EMNLP 2020)
Python
20
star
92

iwslt-autodub-task

Python
20
star
93

street-reasoning

STREET: a multi-task and multi-step reasoning dataset
Python
19
star
94

contrastive-controlled-mt

Code and data for the IWSLT 2022 shared task on Formality Control for SLT
Ruby
19
star
95

pizza-semantic-parsing-dataset

The PIZZA dataset continues the exploration of task-oriented parsing by introducing a new dataset for parsing pizza and drink orders, whose semantics cannot be captured by flat slots and intents.
Python
19
star
96

redset

Redset is a dataset containing three months worth of user query metadata that ran on a selected sample of instances in the Amazon Redshift fleet. We provide query metadata for 200 provisioned and serverless instances each.
19
star
97

fast-rl-with-slow-updates

Jupyter Notebook
18
star
98

few-shot-baseline

Python
17
star
99

doc-mt-metrics

Python
17
star
100

normalizer-free-robust-training

Official implementation of "Removing Batch Normalization Boosts Adversarial Training" (ICML'22)
Python
17
star