• Stars
    star
    1,251
  • Rank 37,296 (Top 0.8 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created almost 5 years ago
  • Updated 24 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends

Splink Logo

pypi Downloads Documentation

Fast, accurate and scalable probabilistic data linkage

Splink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets that lack unique identifiers.

Key Features

⚑ Speed: Capable of linking a million records on a laptop in around a minute.
🎯 Accuracy: Support for term frequency adjustments and user-defined fuzzy matching logic.
🌐 Scalability: Execute linkage in Python (using DuckDB) or big-data backends like AWS Athena or Spark for 100+ million records.
πŸŽ“ Unsupervised Learning: No training data is required for model training.
πŸ“Š Interactive Outputs: Multiple interactive visualisations help users understand their model and diagnose problems.

Splink's linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customizations to improve accuracy.

What does Splink do?

Consider the following records that lack a unique person identifier:

tables showing what splink does

Splink predicts which rows link together:

tables showing what splink does

and clusters these links to produce an estimated person ID:

tables showing what splink does

What data does Splink work best with?

Before using Splink, input data should be standardized, with consistent column names and formatting (e.g., lowercased, punctuation cleaned up, etc.).

Splink performs best with input data containing multiple columns that are not highly correlated. For instance, if the entity type is persons, you may have columns for full name, date of birth, and city. If the entity type is companies, you could have columns for name, turnover, sector, and telephone number.

High correlation occurs when the value of a column is highly constrained (predictable) from the value of another column. For example, a 'city' field is almost perfectly correlated with 'postcode'. Gender is highly correlated with 'first name'. Correlation is particularly problematic if all of your input columns are highly correlated.

Splink is not designed for linking a single column containing a 'bag of words'. For example, a table with a single 'company name' column, and no other details.

Documentation

The homepage for the Splink documentation can be found here. Interactive demos can be found here, or by clicking the following Binder link:

Binder

The specification of the Fellegi Sunter statistical model behind splink is similar as that used in the R fastLink package. Accompanying the fastLink package is an academic paper that describes this model. The Splink documentation site and a series of interactive articles also explores the theory behind Splink.

The Office for National Statistics have written a case study about using Splink to link 2021 Census data to itself.

Installation

Splink supports python 3.7+. To obtain the latest released version of splink you can install from PyPI using pip:

pip install splink

or, if you prefer, you can instead install splink using conda:

conda install -c conda-forge splink

Should you require a more bare-bones version of Splink without DuckDB, please see the following area of the docs:

DuckDBless Splink Installation

Quickstart

The following code demonstrates how to estimate the parameters of a deduplication model, use it to identify duplicate records, and then use clustering to generate an estimated unique person ID.

For more detailed tutorial, please see here.

from splink.duckdb.linker import DuckDBLinker
import splink.duckdb.comparison_library as cl
import splink.duckdb.comparison_template_library as ctl
import splink.duckdb.blocking_rule_library as brl
from splink.datasets import splink_datasets

df = splink_datasets.fake_1000

settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
        brl.exact_match_rule("first_name"),
        brl.exact_match_rule("surname"),
    ],
    "comparisons": [
        ctl.name_comparison("first_name"),
        ctl.name_comparison("surname"),
        ctl.date_comparison("dob", cast_strings_to_date=True),
        cl.exact_match("city", term_frequency_adjustments=True),
        ctl.email_comparison("email"),
    ],
}

linker = DuckDBLinker(df, settings)
linker.estimate_u_using_random_sampling(max_pairs=1e6)

blocking_rule_for_training = brl.and_(
                                brl.exact_match_rule("first_name"), 
                                brl.exact_match_rule("surname")
                                )

linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)

blocking_rule_for_training = brl.exact_match_rule("dob")
linker.estimate_parameters_using_expectation_maximisation(blocking_rule_for_training)

pairwise_predictions = linker.predict()

clusters = linker.cluster_pairwise_predictions_at_threshold(pairwise_predictions, 0.95)
clusters.as_pandas_dataframe(limit=5)

Videos

Support

To find the best place to ask a question, report a bug or get general advice, please refer to our Contributing Guide.

Awards

πŸ₯‡ Analysis in Government Awards 2020: Innovative Methods: Winner

πŸ₯‡ MoJ DASD Awards 2020: Innovation and Impact - Winner

πŸ₯‡ Analysis in Government Awards 2022: People's Choice Award - Winner

πŸ₯ˆ Analysis in Government Awards 2022: Innovative Methods Runner up

Citation

If you use Splink in your research, we'd be grateful for a citation as follows:

@article{Linacre_Lindsay_Manassis_Slade_Hepworth_2022,
	title        = {Splink: Free software for probabilistic record linkage at scale.},
	author       = {Linacre, Robin and Lindsay, Sam and Manassis, Theodore and Slade, Zoe and Hepworth, Tom and Kennedy, Ross and Bond, Andrew},
	year         = 2022,
	month        = {Aug.},
	journal      = {International Journal of Population Data Science},
	volume       = 7,
	number       = 3,
	doi          = {10.23889/ijpds.v7i3.1794},
	url          = {https://ijpds.org/article/view/1794},
}

Acknowledgements

We are very grateful to ADR UK (Administrative Data Research UK) for providing the initial funding for this work as part of the Data First project.

We are extremely grateful to professors Katie Harron, James Doidge and Peter Christen for their expert advice and guidance in the development of Splink. We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work. Any errors remain our own.

More Repositories

1

splink_demos

Interactive notebooks containing demonstration code of the splink library
HTML
38
star
2

shinyGovstyle

Now up to GDS frontend version v4.0.0
CSS
38
star
3

airflow-pdf2embeddings

NLP tool for scraping text from a corpus of PDF files, embedding the sentences in the text and finding semantically similar sentences to a given search query.
Python
35
star
4

xltabr

xltabr: An R package for writing formatted cross tabulations (contingency tables) to Excel using openxlsx
R
31
star
5

etl-pipeline-example

An example of an ETL pipeline that lays out generic DE processes. This is now out of date but still provides useful information
Python
26
star
6

coffee-and-coding-public

MoJ coffee and coding sessions that can be made publicly available
HTML
24
star
7

etl_manager

A python package to create a database on the platform using our moj data warehousing framework
Python
20
star
8

IntroRTraining

Introductory R training
HTML
18
star
9

dataengineeringutils3

Fully unit tested utility functions for data engineering. Python 3 only.
Python
14
star
10

our-coding-standards

DASD's coding principles for analytical projects
HTML
13
star
11

mojchart

R package for formatting ggplot2 charts and applying MoJ corporate colours.
R
13
star
12

user-guidance

User guidance for the MoJ Analytical Platform
HTML
12
star
13

writing_functions_in_r

How to write functions in R
HTML
12
star
14

rpackage_training

Making and developing R packages
11
star
15

pq-tool

Tool to analyse past parliamentary questions with visualisation in RShiny
R
10
star
16

splink_graph

pyspark-parallelised functions producing graph-theoretical metrics in connected component clusters for use in record-linkage (or other domains)
HTML
10
star
17

pydbtools

Python version of dbtools
Python
10
star
18

data-engineering-and-modelling-applicant-info

Information for potential applicants to MoJ Data Engineering, including links to our work and information about our teams.
9
star
19

mojap-arrow-pd-parser

Conforms pandas to "correct" datatypes to ensure data in/out using CSV, JSONL and Parquet is read the same (using arrow).
Python
8
star
20

s3tools

Interact with files in s3 on the Analytical Platform
R
8
star
21

mojrap

For generalised functions for RAP. If there are any functions in your RAP that will be useful to other people, please use this space to share them.
R
8
star
22

docker_spark_history_ui

A dockerised version of the spark history server which enables us to access metrics in the spark ui from a log generated by AWS glue
Dockerfile
8
star
23

graph-club

Tri-weekly hackathons and talks on Graph Theory and Network Analysis.
Jupyter Notebook
8
star
24

splink_synthetic_data

Generate synthetic datasets for linking
Python
7
star
25

rmarkdown_training

Short training session on RMarkdown, for JSAS
R
7
star
26

mojspeakr

Formatting RMarkdown into govspeak for publishing on gov.uk
R
7
star
27

dataengineeringutils

A python package containing functions that help manage our data management processes on AWS
Python
6
star
28

data_linter

Docker image used to automatically validate data
Python
6
star
29

fuzzyfinder

Fuzzy search for matching records and score search results according to how closely they match
Python
6
star
30

mojap-aws-tools-demo

A repo to test the different open source AWS tools we use / maintain for Data Engineering
Jupyter Notebook
6
star
31

NLP-guidance

Some thinking about Natural Language Processing
JavaScript
6
star
32

dbtools

Basic wrapper functions to query data using boto3 and Athena
R
5
star
33

splink_cluster_studio

Create interactive dashboards to visualise and analyse the outputs of data linking
JavaScript
5
star
34

mojap-metadata

Schema definitions and management of our metadata used by the Data Engineering Team at MoJ
Python
5
star
35

Rdbtools

Accessing Athena on the Analytical Platform
R
4
star
36

splink_scalaudfs

Data linking functions in Scala, to be used in a Pyspark environment.
Scala
4
star
37

data_generator

Generates data using faker and our meta data schemas
Python
4
star
38

rshiny-template

Template RShiny project
R
4
star
39

intro_r_training_extension

An extension to the IntroRTraining course
HTML
4
star
40

iam_builder

Little helper to write IAM policies
Python
4
star
41

ggplotTraining

HTML
4
star
42

mojSuppression

R
3
star
43

QA.that

R
3
star
44

platform_user_guidance

**DEPRECATED** See https://github.com/moj-analytical-services/user-guidance
HTML
3
star
45

data-engineering-exports

Infrastructure to allow data from the Analytical Platform to be accessed by other services
Python
3
star
46

goodtables_test

Public repo with examples of goodtables
Jupyter Notebook
3
star
47

splink_comparison_viewer

JavaScript
3
star
48

s3_data_packer

Python
3
star
49

Rs3tools

R
3
star
50

coffee_roulette_pairs

A package to generate random pairings for Coffee Roulette
R
3
star
51

FuzzyMatchR

Reference page to link to R implementation of a probabilistic matching function
3
star
52

mojverse

The tidyverse equivalent for MoJ packages
3
star
53

intro_to_github_training

R
3
star
54

AWS-study-group-quizzes

2
star
55

I-RAP

R
2
star
56

data-engineering-template

Standard content, settings and hooks for data engineering
Shell
2
star
57

rmarkdown-vegawidget-template

A template for a deployed app that renders a markdown report
R
2
star
58

s3browser

A R Studio Addin that allows you to browse the files you have access to in S3
JavaScript
2
star
59

splink_data_generation

Generate datasets with known m and u probabilities to feed into the Fellegi Sunter model
Jupyter Notebook
2
star
60

RSuperscript

A function that allows you to add superscripts and subscripts to cells in excel
R
2
star
61

airflow_osrm_scrape

Scrapes the open streetmap routing machine for all combinations of LSOAs, and MSOAs
Python
2
star
62

metadata_vis

Data discovery tool that ingests metadata and makes it searchable. Uses metadata in the format required for https://github.com/moj-analytical-services/etl_manager
CSS
2
star
63

OPG

Python
2
star
64

airflow-de-intro-project

Python
2
star
65

SQL_from_square_one

Guidance on learning SQL from square one (i.e. zero knowledge)
HTML
1
star
66

iceberg-evaluation

Jupyter Notebook
1
star
67

random-coffee-trials

Automating rct
R
1
star
68

jwmodel

Judicial Workforce Modelling R Package
R
1
star
69

airflow_get_index_of_multiple_deprivation

Airflow job to get dataset of index of multiple deprivation
Python
1
star
70

datacleaningutils

Unit tested functions for cleaning data as part of ETL processes
Python
1
star
71

rshiny-test

R
1
star
72

cronjob-template

Example of project with a Cronjob
1
star
73

shiny-headers-demo

R
1
star
74

gluejobutils

Python 2.7 utility functions to include with AWS glue jobs
Python
1
star
75

actions-lint-python

1
star
76

lookup_hmcts_regions

A lookup table that maps local authorities to HMCTS regions.
Jupyter Notebook
1
star
77

.github

Ministry of Justice Analytical Services GitHub workflow templates
1
star
78

pq_scraper

Parliamentary Questions (PQ) scraper
Python
1
star
79

a11ycharts

R
1
star
80

airflow-murad-ali-j-test

Python
1
star
81

vega-lite-away-day

R
1
star
82

data_linter_deprecated

A package to lint data against our meta data schemas
Python
1
star
83

predictr

R
1
star
84

kerins-shiny-app

R
1
star
85

civilreadr

Easy reading of published civil CSVs
R
1
star
86

template-airflow-python

Template repository for running airflow python tasks in Kubernetes/Docker
Python
1
star
87

ap-tools-training

R
1
star
88

geoharmonise

R
1
star
89

splink_examples_synthetic_data

Python
1
star
90

criminal_history_sankey

A sankey diagram for criminal history statistics
HTML
1
star
91

rshiny-xoen-kaniko-test

Testing kaniko to build Docker images
R
1
star
92

github-outside-collaborators

Manage outside collaborators on our Github repositories
Ruby
1
star
93

airflow-platform-user-data

Airflow job to gather platform user data from Auth0
Python
1
star
94

mojap-airflow-tools

A few wrappers and tools to use with Airflow on the Analytical Platform
Python
1
star
95

oracleConnectR

Wrapper to simplify connection to Oracle databases
R
1
star
96

mojtext

Functions to automate text
R
1
star
97

mojtable

R
1
star