• Stars
    star
    403
  • Rank 107,140 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 10 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🆔 Examples for using the dedupe library

Dedupe Examples

Example scripts for the dedupe, a library that uses machine learning to perform de-duplication and entity resolution quickly on structured data.

Part of the Dedupe.io cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data. For more details, see the differences between Dedupe.io and the dedupe library.

To get these examples:

git clone https://github.com/dedupeio/dedupe-examples.git
cd dedupe-examples

or download this repository

cd /path/to/downloaded/file
unzip master.zip
cd dedupe-examples

Setup

We recommend using virtualenv and virtualenvwrapper for working in a virtualized development environment. Read how to set up virtualenv.

Once you have virtualenvwrapper set up,

mkvirtualenv dedupe-examples
pip install -r requirements.txt

Afterwards, whenever you want to work on dedupe-examples,

workon dedupe-examples

CSV example - early childhood locations

This example works with a list of early childhood education sites in Chicago from 10 different sources.

cd csv_example
pip install unidecode
python csv_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

To see how you might use dedupe with smallish data, see the annotated source code for csv_example.py.

Patent example - patent holders

This example works with Dutch inventors from the PATSTAT international patent data file

cd patent_example
pip install unidecode
python patent_example.py

(use 'y', 'n' and 'u' keys to flag duplicates for active learning, 'f' when you are finished)

Record Linkage example - electronics products

This example links two spreadsheets of electronics products and links up the matching entries. Each dataset individually has no duplicates.

cd record_linkage_example
python record_linkage_example.py

To see how you might use dedupe for linking datasets, see the annotated source code for record_linkage_example.py.

Gazetteer example - electronics products

This example links two spreadsheets of electronics products and links up the matching entries using the Gazetteer class

cd gazetteer_example.py
python gazetteer_example.py

MySQL example - IL campaign contributions

See mysql_example/README.md for details

To see how you might use dedupe with bigish data, see the annotated source code for mysql_example.

PostgreSQL big dedupe example - PostgreSQL example on large dataset

See pgsql_big_dedupe_example/README.md for details

This is the same example as the MySQL IL campaign contributions dataset above, but ported to run on PostgreSQL.

Training

The secret sauce of dedupe is human input. In order to figure out the best rules to deduplicate a set of data, you must give it a set of labeled examples to learn from.

The more labeled examples you give it, the better the deduplication results will be. At minimum, you should try to provide 10 positive matches and 10 negative matches.

The results of your training will be saved in a JSON file for future runs of dedupe.

Here's an example labeling operation:

Phone :  2850617
Address :  3801 s. wabash
Zip :
Site name :  ada s. mckinley st. thomas cdc

Phone :  2850617
Address :  3801 s wabash ave
Zip :
Site name :  ada s. mckinley community services - mckinley - st. thomas

Do these records refer to the same thing?
(y)es / (n)o / (u)nsure / (f)inished

More Repositories

1

dedupe

🆔 A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.
Python
4,080
star
2

csvdedupe

🆔 Command line tool for deduplicating CSV files
Python
409
star
3

address-matching

Python script for matching a list of messy addresses against a gazetteer using dedupe.
Python
60
star
4

affinegap

📐 A Cython implementation of the affine gap string distance
Cython
58
star
5

hcluster

Hierarchical Clustering Algorithms
Python
35
star
6

dedupe-geocoder

📍 Demonstration of how dedupe might be used as geocoder
Python
17
star
7

doublemetaphone

🔉 Python wrapper for a C++ Double Metaphone
C++
15
star
8

fuzzycategory

📐 Fuzzy Categorical Distances
Python
14
star
9

rlr

Regularized Logistic Regression
Python
11
star
10

dedupe-variable-address

Address Variable Type for dedupe
Python
9
star
11

dedupe-variable-person

Dedupe variable for person names. just people. no companies.
Python
9
star
12

dedupe-variable-name

name variable type for dedupe
Python
8
star
13

soft-tfidf

Mispelling tolerant tf-idf similarity metric
6
star
14

highered

CRF Edit Distance
Python
6
star
15

dedupeio-web-api-docs

Dedupe.io web API allows for matching and training against projects using a standard RESTful framework.
Python
6
star
16

dedupe-variable-employer

Python
5
star
17

dedupe-vowpal

Vowpal Wabbit Active Labeler for Dedupe
Python
4
star
18

dedupe-variable-datetime

DateTime variable for dedupe
Python
4
star
19

dedupe-variable-fuzzycategory

Dedupe Variable for Fuzzy Categories
Python
4
star
20

categorical-distance

📐 Compare categorical variables
Python
4
star
21

parseratorvariable

Base class for dedupe variables for parsed fields
Python
3
star
22

simplecosine

📐 simple cosine distance
Python
3
star
23

dedupe-variable-number

Try to cast strings to numbers, then compare
Python
3
star
24

datetime-distance

 📐 Compare dates and times
Python
3
star
25

dedupe-variable-ilcs

Dedupe variable for Illinois Compiled Statute (ILCS) codes
Python
2
star