• Stars
    star
    311
  • Rank 134,521 (Top 3 %)
  • Language
  • Created almost 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A list of free data matching and record linkage software.

Data Matching software

This is a list of (Fuzzy) Data Matching software. The software in this list is FOSS (Free and open-source software).

The term data matching is used to indicate the procedure of bringing together information from two or more records that are believed to belong to the same entity. Data matching has two applications: (1) to match data across multiple datasets (linkage) and (2) to match data within a dataset (deduplication). See the Wikipedia page about data matching for more information.

Similar terms: record linkage, data matching, deduplication, fuzzy matching, entity resolution

Overview

The table below gives a dense overview of data matching software properties. The properties evaluated are Application Programming Interface (API), Graphical User Interface (GUI), Linking, Deduplication, Supervised Learning, Unsupervised Learning and Active Learning.

Software API GUI Link Dedup Supervised
Learning
Unsupervised
Learning
Active
Learning
AtyImo PySpark ❌ βœ… βœ… ❌ ❌ ❌
Dedupe Python ❌ βœ… βœ… βœ… ❌ βœ…
dirty-cat Python ❌ βœ… βœ… βœ… βœ… ❌
fastLink R ❌ βœ… ❔ ❌ βœ… ❌
FEBRL Python βœ… βœ… βœ… ❌ ❌ ❌
FRIL Java βœ… βœ… ❌ ❔ βœ… ❌
FuzzyMatcher Python ❌ βœ… ❌ ❌ βœ… ❌
hlink PySpark ❌ βœ… ❔ ❌ ❌ ❌
JedAI Java βœ… βœ… ❔ βœ… ❔ ❔
PRIL SQL ❌ βœ… ❔ ❔ ❔ ❔
Python Record Linkage Toolkit Python ❌ βœ… βœ… βœ… βœ… ❌
RecordLinkage (R) R ❌ βœ… βœ… βœ… βœ… ❌
Reclin2 R ❌ βœ… βœ… βœ… ❌ ❌
RELAIS ❌ βœ… βœ… ❔ ❔ βœ… ❌
ReMaDDer ❌ βœ… βœ… βœ… ❌ βœ… ❌
RLTK Python ❌ βœ… βœ… βœ… ❌ ❌
Splink Python ❌ βœ… βœ… βœ… βœ… ❌
Zingg Python ❌ βœ… βœ… βœ… ❌ ❌

βœ… Yes/Implemented ❌ No/Not implemented ❔ Unknown

Software

This section describes data matching software. The software is alphabetically ordered.

AtyImo

AtyImo implements a mixture of deterministic and probabilistic routines for data linkage. Initially developed in 2013 to serve as a linkage tool supporting a joint Brazil–U.K. project aiming at building a large population-based cohort with data from more than 100 million participants and producing disease-specific data to facilitate diverse epidemiological research studies.

License GitHub
Language Python Spark
Latest release NA
Downloads per month
GitHub stars GitHub stars

Dedupe

Dedupe is a python library for fuzzy matching, deduplication and entity resolution on structured data. The library makes use of active learning to match record pairs. Active learning is useful in cases without training data. Dedupe has a side-product for deduplicating CSV files, csvdedupe, through the command line. Dedupeio also offers commercial products for data matching.

License PyPI - License
Language PyPI - Python Version
Latest release PyPI
Downloads per month PyPI - Downloads
GitHub stars GitHub stars

dirty-cat

dirty-cat is an open-source Python package that facilitates machine-learning with with dirty data: robust to morphological variants, such as typos. Some of the currently supported features are: fuzzy joining tables on dirty numerical, string or mixed type columns, deduplicating and encoding dirty categorical variables for ML. This example illustrates why to use dirty-cat encoders rather than OneHotEncoder on dirty data and this one shows how to join multiple dirty tables for ML. The transfomers (TableVectorizer, FeatureAugmenter) are scikit-learn compatible, and easily introduced into ML pipelines.

License PyPI - License
Language PyPI - Python Version Spark
Latest release PyPI
Downloads per month PyPI - Downloads
GitHub stars GitHub stars

fastLink

Implements a Fellegi-Sunter probabilistic record linkage model that allows for missing data and the inclusion of auxiliary information. This includes functionalities to conduct a merge of two datasets under the Fellegi-Sunter model using the Expectation-Maximization algorithm. fastLink is a programming API written in R. (Enamorado, Fifield & Imai, 2017) [source code]

License CRAN/METACRAN
Language R
Latest release CRAN
Downloads per month metacran downloads
GitHub stars GitHub stars

FEBRL

Febrl (Freely Extensible Biomedical Record Linkage) is a training tool suitable for users to learn and experiment with record linkage techniques, as well as for practitioners to conduct linkages with data sets containing up to several hundred thousand records. Febrl is a data matching tool with a large number of algorithms implemented and offers a Python programming interface as well as simple GUI. Febrl doesn't offer unsupervised and active learning algorithms. The software is no longer actively maintained. (Christen, 2008) [source code]

License Custom
Language Python
Latest release
Downloads per month
GitHub stars

FRIL

FRIL (Fine-grained Records Integration and Linkage tool) is free tool that enables record linkage through a GUI. The tool implements automatic weights estimation through the EM-algorithm and offers serveral techniques to make record pairs. FRIL was developed by the Emory University and is not longer maintained. [source code]

License Custom
Language Java
Latest release
Downloads per month
GitHub stars

FuzzyMatcher

A Python package that allows the user to fuzzy match two pandas dataframes based on one or more fields in common. The functionality is limited at the moment. [source code]

License PyPI - License
Language PyPI - Python Version
Latest release PyPI
Downloads per month PyPI - Downloads
GitHub stars GitHub stars

hlink

A Python package designed to link two datasets. The primary use case was for linking demographics in the Household -> Person hierarchical structure, however it can be used to link generic datasets as well by skipping household linking tasks. It allows for probabilistic and deterministic record linkage. [source_code]

License PyPI - License
Language PyPI - Python Version
Latest release PyPI
Downloads per month PyPI - Downloads
GitHub stars GitHub stars

JedAI

Java gEneric DAta Integration (JedAI) Toolkit is a Entity Resolution Tool developed by a group of univeristies. JedAI offers a Graphical User Interface. [source code]

License GitHub
Language Java
Latest release
Downloads per month
GitHub stars GitHub stars

PRIL

PRIL (Point-of-contact Interactive Record Linkage) is a record linkage program with a GUI. PRIL can be used to link datasets about individuals. (Rentsch CT, Kabudula CW, Catlett J et al., 2017) [source code]

License GitHub
Language SQLPL
Latest release
Downloads per month
GitHub stars GitHub stars

Python Record Linkage Toolkit

The Python Record Linkage Toolkit is a library to link records in or between data sources. The toolkit provides most of the tools needed for record linkage and deduplication. The package is developed for research and the linking of small or medium sized files.

License PyPI - License
Language PyPI - Python Version
Latest release PyPI
Downloads per month PyPI - Downloads
GitHub stars GitHub stars

RecordLinkage (R)

Package written in R that provides functions for linking and de-duplicating data sets. Both supervised and unsupervised classification algorithms are available. Record pairs can be compared with a limited set of algorithms. The package is published on CRAN.

License CRAN/METACRAN
Language R
Latest release CRAN
Downloads per month metacran downloads
GitHub stars

Reclin2

Package written in R that provides functions for linking data sets. The framework offers the option to compute the weigths of the Fellegi-Sunter model. It doesn't implement an undersupervised algorithms to predict the cutoff. The package is published on CRAN. Formerly https://github.com/djvanderlaan/reclin.

License CRAN/METACRAN
Language R
Latest release CRAN
Downloads per month metacran downloads
GitHub stars GitHub stars

RELAIS

RELAIS (REcord Linkage At IStat) is a toolkit providing a set of techniques for dealing with record linkage projects. IStat is the main producer of official statistics in Italy.

License EUPL-1.1
Language R/Java
Latest release
Downloads per month
GitHub stars

ReMaDDer

ReMaDDer is unsupervised free fuzzy data matching software with a GUI. ReMaDDer is capable to perform fully automatic fuzzy record matching without human expert intervention, while attaining accuracy of human clerical review. NOTE: The software is free, but not open source and requires an internet connection to work.

License
Language
Latest release
Downloads per month
GitHub stars

RLTK

The Record Linkage ToolKit (RLTK) is a general-purpose open-source record linkage package. The toolkit provides a full pipeline needed for record linkage and deduplication.

License PyPI - License
Language Python
Latest release PyPI
Downloads per month PyPI - Downloads
GitHub stars GitHub stars

Splink

Splink is a Python package for probabilistic record linkage at scale. It supports multiple backends to execute linkage jobs, including DuckDB Apache Spark and AWS Athena. It is able to perform linking and deduplication of very large datasets of tens of millions of records with runtimes of less than an hour, including the clustering of results using connected components. It includes interactive tools to support the lifecycle of a linking project, from exploratory analysis through to diagnostics and quality assurance.[source code]

License PyPI - License
Language PyPI - Python Version
Latest release PyPI
Downloads per month PyPI - Downloads
GitHub stars GitHub stars

Zingg

Zingg is an open-source ML based tool for entity resolution with which analytics engineer and the data scientist can quickly integrate data silos and build unified views at scale. Zingg has the ability to connect to disparate data source, local and cloud file systems in any format, enterprise applications and relational, NoSQL and cloud databases and warehouses. It scales to large volume of data and you can define domain specific functions to improve matching. Not only Zingg support English as well as Chinese, Thai, Japanese, Hindi and other languages, it also has a very active slack community where people around the globe come and help and share their views.

License PyPI - License
Language PyPI - Python Version Spark
Latest release PyPI
Downloads per month PyPI - Downloads
GitHub stars GitHub stars

Outdated/ no longer available

BigMatch (by USA census)

A record linkage tool for use in matching a very large file against a moderate size file developed by the USA Census Bureau. There are several papers available about this program (BigMatch, 2007)

The Link King

The Link King’s graphical user interface (GUI) makes record linkage and unduplication easy for beginning and advanced users. The software requires a SAS license. SAS

Contributing

Do you know an open source and/or free data matching tool? Please open an issue or do a Pull Request. The same holds for missing or incomplete information.

This project is initiated by the author of the Python Record Linkage Toolkit @J535D165. The aim is to get a list and comparison of data matching software.

This list is licensed under CC-BY-SA 3.0.

More Repositories

1

recordlinkage

A powerful and modular toolkit for record linkage and duplicate detection in Python
Python
838
star
2

CoronaWatchNL

Numbers concerning COVID-19 disease cases in The Netherlands by RIVM, LCPS, NICE, ECML, and Rijksoverheid.
Jupyter Notebook
145
star
3

cbsodata

Unofficial Statistics Netherlands (CBS) opendata API client for Python
Python
35
star
4

pyalex

A Python library for OpenAlex (openalex.org)
Python
34
star
5

recordlinkage-annotator

A browser user interface for manual labeling of record pairs.
JavaScript
34
star
6

PublicSectorNL

Open Source in the public sector in the Netherlands
Makefile
28
star
7

FEBRL-fork-v0.4.2

Fork of the Freely Extensible Biomedical Record Linkage program
Python
22
star
8

datahugger

One downloader for many scientific data and code repositories! DOIπŸ‘Data
Python
12
star
9

recordlinkage-review

Make golden data or validate your record linkage.
JavaScript
7
star
10

scitree

Scitree is a recursive directory listing tool optimized for science
Python
5
star
11

cbsshape

Simple interface for CBS Wijk en Buurtkaart.
R
3
star
12

Data-Science-Day

Additional material for the Data Science Day (Utrecht University) workshop: "Data Engineering: Clean and Integrate Your Data!"
Jupyter Notebook
2
star
13

scisort

Sort files in research project folders in a scientific order
Python
2
star
14

CoronaWatchNLExtended

Models based on COVID-19 disease counts in The Netherlands, as reported by RIVM
Python
2
star
15

recordlinkage-notebooks

Jupyter Notebook
2
star
16

recordlinkage-performance

Experiments to get the best performance!
Jupyter Notebook
1
star