• Stars
    star
    233
  • Rank 172,198 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 4 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for the paper "Deep Entity Matching with Pre-trained Language Models"

Ditto: Deep Entity Matching with Pre-Trained Language Models

Update: a new light-weight version based on new versions of Transformers

Ditto is an entity matching (EM) solution based on pre-trained language models such as BERT. Given a pair of data entries, EM checks if the two entries refer to the same real-world entities (products, businesses, publications, persons, etc.). Ditto leverages the powerful language understanding capability of pre-trained language models (LMs) via fine-tuning. Ditto serializes each data entry into a text sequence and casts EM as a sequence-pair classification problem solvable by LM fine-tuning. We also employ a set of novel optimizations including summarization, injecting domain-specific knowledge, and data augmentation to further boost the performance of the matching models.

For more technical details, see the Deep Entity Matching with Pre-Trained Language Models paper.

Requirements

  • Python 3.7.7
  • PyTorch 1.9
  • HuggingFace Transformers 4.9.2
  • Spacy with the en_core_web_lg models
  • NVIDIA Apex (fp16 training)

Install required packages

conda install -c conda-forge nvidia-apex
pip install -r requirements.txt
python -m spacy download en_core_web_lg

The EM pipeline

A typical EM pipeline consists of two phases: blocking and matching. The EM pipeline of Ditto. The blocking phase typically consists of simple heuristics that reduce the number of candidate pairs to perform the pairwise comparisons. Ditto optimizes the matching phase which performs the actual pairwise comparisons. The input to Ditto consists of a set of labeled candidate data entry pairs. Each data entry is pre-serialized into the following format:

COL title VAL microsoft visio standard 2007 version upgrade COL manufacturer VAL microsoft COL price VAL 129.95

where COL and VAL are special tokens to indicate the starts of attribute names and attribute values. A complete example pair is of the format

<entry_1> \t <entry_2> \t <label>

where the two entries are serialized and <label> is either 0 (no-match) or 1 (match). In our experiments, we evaluated Ditto using two benchmarks:

  • the ER_Magellan benchmarks used in the DeepMatcher paper. This benchmark contains 13 datasets of 3 categories: Structured, Dirty, and Textual representing different dataset characteristics.
  • the WDC product matching benchmark. This benchmark contains e-commerce product offering pairs from 4 domains: cameras, computers, shoes, and watches. The training data of each domain is also sub-sampled into different sizes, small, medium, large, and xlarge to test the label efficiency of the models.

We provide the serialized version of their datasets in data/. The dataset configurations can be found in configs.json.

Training with Ditto

To train the matching model with Ditto:

CUDA_VISIBLE_DEVICES=0 python train_ditto.py \
  --task Structured/Beer \
  --batch_size 64 \
  --max_len 64 \
  --lr 3e-5 \
  --n_epochs 40 \
  --lm distilbert \
  --fp16 \
  --da del \
  --dk product \
  --summarize

The meaning of the flags:

  • --task: the name of the tasks (see configs.json)
  • --batch_size, --max_len, --lr, --n_epochs: the batch size, max sequence length, learning rate, and the number of epochs
  • --lm: the language model. We now support bert, distilbert, and albert (distilbert by default).
  • --fp16: whether train with the half-precision floating point optimization
  • --da, --dk, --summarize: the 3 optimizations of Ditto. See the followings for details.
  • --save_model: if this flag is on, then save the checkpoint to {logdir}/{task}/model.pt.

Data augmentation (DA)

If the --da flag is set, then ditto will train the matching model with MixDA, a data augmentation technique for text data. To use data augmentation, one transformation operator needs to be specified. We currently support the following operators for EM:

Operators Details
del Delete a span of tokens
swap Shuffle a span of tokens
drop_col Delete a whole attribute
append_col Move an attribute (append to the end of another attr)
all Apply all the operators uniformly at random

Domain Knowledge (DK)

Inject domain knowledge to the input sequences if the --dk flag is set. Ditto will preprocess the serialized entries by

  • tagging informative spans (e.g., product ID, persons name) by inserting special tokens (e.g., ID, PERSON)
  • normalizing certain spans (e.g., numbers) We currently support two injection modes: --dk general and --dk product for the general domain and for the product domain respectively. See ditto/knowledge.py for more details.

Summarization

When the --summarize flag is set, the input sequence will be summarized by retaining only the high TF-IDF tokens. The resulting sequence will be of length no more than the max sequence length (i.e., --max_len). See ditto/summarize.py for more details.

To run the matching models

Use the command:

CUDA_VISIBLE_DEVICES=0 python matcher.py \
  --task wdc_all_small \
  --input_path input/input_small.jsonl \
  --output_path output/output_small.jsonl \
  --lm distilbert \
  --max_len 64 \
  --use_gpu \
  --fp16 \
  --checkpoint_path checkpoints/

where --task is the task name, --input_path is the input file of the candidate pairs in the jsonlines format, --output_path is the output path, and checkpoint_path is the path to the model checkpoint (same as --logdir when training). The language model --lm and --max_len should be set to the same as the one used in training. The same --dk and --summarize flags also need to be specified if they are used at the training time.

Colab notebook

You can also run training and prediction using this colab notebook.

More Repositories

1

ginza

A Japanese NLP Library using spaCy as framework based on Universal Dependencies
Python
727
star
2

HappyDB

A corpus of 100,000 happy moments
354
star
3

bunkai

Sentence boundary disambiguation tool for Japanese texts (日本語文境界判定器)
Python
177
star
4

sato

Code and data for Sato https://arxiv.org/abs/1911.06311.
Python
107
star
5

jrte-corpus

Japanese Realistic Textual Entailment Corpus (NLP 2020, LREC 2020)
Python
75
star
6

opiniondigest

OpinionDigest: A Simple Framework for Opinion Summarization (ACL 2020)
Python
56
star
7

vecscan

Python
49
star
8

SubjQA

A question-answering dataset with a focus on subjective information
40
star
9

t5-japanese

Codes to pre-train Japanese T5 models
Python
39
star
10

ruler

Data Programming by Demonstration (DPBD) for Document Classification
Jupyter Notebook
36
star
11

tagruler

Data programming by demonstration for information extraction and span annotation
JavaScript
35
star
12

coop

☘️ Code for Convex Aggregation for Opinion Summarization (Iso et al; Findings of EMNLP 2021)
Python
31
star
13

doduo

Annotating Columns with Pre-trained Language Models
Python
25
star
14

asdc

Accommodation Search Dialog Corpus (宿泊施設探索対話コーパス)
Python
23
star
15

instruction_ja

Japanese instruction data (日本語指示データ)
Python
21
star
16

rotom

Code for the paper "Rotom: A Meta-Learned Data Augmentation Framework for Entity Matching, Data Cleaning, Text Classification, and Beyond"
Roff
21
star
17

cocosum

🥥 Code & Data for Comparative Opinion Summarization via Collaborative Decoding (Iso et al; Findings of ACL 2022)
Python
20
star
18

ebe-dataset

Evidence-based Explanation Dataset (AACL-IJCNLP 2020)
PLSQL
17
star
19

ginza-transformers

Use custom tokenizers in spacy-transformers
Python
17
star
20

teddy

Code and data for Teddy https://arxiv.org/abs/2001.05171.
Python
15
star
21

zett

🙈 Code for Zero-shot Triplet Extraction by Template Infilling (Kim et al; IJCNLP-AACL 2023)
Python
15
star
22

machamp

The dataset for the paper "Machamp: A Generalized Entity Matching Benchmark" published in CIKM 2021
14
star
23

starmie

Resources for PVLDB 2023 submission
Python
14
star
24

meganno-client

Python
7
star
25

sudowoodo

The source code of the Sudowoodo paper in ICDE 2023
Jupyter Notebook
7
star
26

explainit

Python
5
star
27

desuwa

Feature annotator to morphemes and phrases based on KNP rule files (pure-Python)
Emacs Lisp
5
star
28

react-jupyter-cookiecutter

Python
5
star
29

xatu

🕊️ Code and Data for XATU: A Fine-grained Instruction-based Benchmark for Explainable Text Updates (Zhang et al; LREC-COLING 2024)
Python
4
star
30

magneton

Repository of the Magneton framework for authoring interaction-aware and customizable widgets.
TypeScript
4
star
31

emu

Enhancing Multilingual Sentence Embeddings with Semantic Specialization (AAAI '20)
4
star
32

learnit

A Tool for Machine Learning Beginners
Python
4
star
33

leam

Source code and demo for Leam
Jupyter Notebook
3
star
34

minun

Evaluating Counterfactual Explanations for Entity Matching
Python
3
star
35

llm-longeval

💵 Code for Less is More for Long Document Summary Evaluation by LLMs (Wu, Iso et al; EACL 2024)
Python
3
star
36

jrte-corpus_example

Example codes for Japanese Realistic Textual Entailment Corpus
Python
3
star
37

Tyrogue

Jupyter Notebook
2
star
38

qa-summarization

Ting-Yao's intern project
Python
2
star
39

pilota

✈ SCUD generator (解釈文生成器)
Python
1
star
40

quasi_japanese_reviews

Quasi Japanese Reviews (擬似レビューデータ)
Python
1
star
41

MCR

1
star
42

witqa

1
star