• Stars
    star
    143
  • Rank 257,007 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created almost 4 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PyTorch library for transforming entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Entity Embed

PyPi version PyPI - Python Version CI Documentation Status Coverage Status License: MIT

Entity Embed allows you to transform entities like companies, products, etc. into vectors to support scalable Record Linkage / Entity Resolution using Approximate Nearest Neighbors.

Using Entity Embed, you can train a deep learning model to transform records into vectors in an N-dimensional embedding space. Thanks to a contrastive loss, those vectors are organized to keep similar records close and dissimilar records far apart in this embedding space. Embedding records enables scalable ANN search, which means finding thousands of candidate duplicate pairs of records per second per CPU.

Entity Embed achieves Recall of ~0.99 with Pair-Entity ratio below 100 on a variety of datasets. Entity Embed aims for high recall at the expense of precision. Therefore, this library is suited for the Blocking/Indexing stage of an Entity Resolution pipeline. A scalabale and noise-tolerant Blocking procedure is often the main bottleneck for performance and quality on Entity Resolution pipelines, so this library aims to solve that. Note the ANN search on embedded records returns several candidate pairs that must be filtered to find the best matching pairs, possibly with a pairwise classifier (an example for that is available).

Entity Embed is based on and is a special case of the AutoBlock model described by Amazon.

⚠️ Warning: this project is under heavy development.

Embedding Space Example

Documentation

https://entity-embed.readthedocs.io

Requirements

System

  • MacOS or Linux (tested on latest MacOS and Ubuntu via GitHub Actions).
  • Entity Embed can train and run on a powerful laptop. Tested on a system with 32 GBs of RAM, RTX 2070 Mobile (8 GB VRAM), i7-10750H (12 threads). With batch sizes smaller than 32 and few field types, it's possible to train and run even with 2 GB of VRAM.

Libraries

And others, see requirements.txt.

Installation

pip install entity-embed

For Conda users

If you're using Conda, you must install PyTorch beforehand to have proper CUDA support. Inside the Conda environment, please run the following command before installing Entity Embed using pip:

conda install pytorch torchvision torchaudio cudatoolkit=11.1 -c pytorch -c conda-forge

Examples

Run:

pip install -r requirements-examples.txt

Then check the example Jupyter Notebooks:

Colab

Please check notebooks/google-colab/.

Releases

See CHANGELOG.md.

Credits

This project is maintained by open-source contributors and Vinta Software.

This package was created with Cookiecutter and the audreyr/cookiecutter-pypackage project template.

Commercial Support

alt text

Vinta Software is always looking for exciting work, so if you need any commercial support, feel free to get in touch: [email protected]

References

  • Zhang, W., Wei, H., Sisman, B., Dong, X. L., Faloutsos, C., & Page, D. (2020, January). AutoBlock: A hands-off blocking framework for entity matching. In Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 744-752). (pdf)
  • Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., & Cheng, J. (2020, July). Convolutional Embedding for Edit Distance. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 599-608). (pdf) (code)

Citations

If you use Entity Embed in your research, please consider citing it.

BibTeX entry:

@software{entity-embed,
  title = {{Entity Embed}: Scalable Entity Resolution using Approximate Nearest Neighbors.},
  author = {Juvenal, Flávio and Vieira, Renato},
  url = {https://github.com/vintasoftware/entity-embed},
  version = {0.0.6},
  date = {2021-07-16},
  year = {2021}
}

More Repositories

1

django-react-boilerplate

Django 5, React, Bootstrap 5 with Python 3 and webpack project boilerplate
Python
1,918
star
2

django-templated-email

Django module to easily send templated emails using django templates, or using a transactional mail provider (mailchimp, silverpop, etc.)
Python
680
star
3

django-role-permissions

A django app for role based permissions.
Python
650
star
4

python-linters-and-code-analysis

Python Linters and Code Analysis tools curated list
505
star
5

tapioca-wrapper

Python API client generator
Python
344
star
6

python-api-checklist

Useful checklist for building good Python library APIs, based on "How to make a good library API" PyCon 2017 talk.
335
star
7

playbook

Vinta's Best Moves Compiled
219
star
8

awesome-django-security

A collection of Django security-related tools and libs.
198
star
9

django-ai-assistant

Integrate AI Assistants with Django to build intelligent applications
Python
196
star
10

drf-rw-serializers

Generic views, viewsets and mixins that extend the Django REST Framework ones adding separated serializers for read and write operations
Python
176
star
11

classy-django-rest-framework

Detailed descriptions, with full methods and attributes, for each of Django REST Framework's class-based views and serializers.
Python
170
star
12

django-zombodb

Easy Django integration with Elasticsearch through ZomboDB Postgres Extension
Python
149
star
13

django-virtual-models

Improve performance and maintainability with a prefetching layer in your Django project
Python
146
star
14

celery-tasks-checklist

Useful checklist for building great Celery tasks.
117
star
15

django-apps-checklist

Useful checklist for build great Django apps. Feel free to contribute!
104
star
16

django-celerybeat-status

A library that integrates with django admin and shows in a simple GUI when your periodic are going to run next.
Python
100
star
17

deduplication-slides

"1 + 1 = 1 or Record Deduplication with Python" Jupyter Notebook
Jupyter Notebook
83
star
18

django-knowledge-share

The engine behind Vinta's Lessons Learned page.
Python
37
star
19

django-production-launch-checklist

A checklist we use here at Vinta before launching a product we've been working on.
37
star
20

tapioca-facebook

Facebook GraphAPI wrapper using tapioca
Python
28
star
21

checklist-para-propostas-pybr

Checklist para propostas de palestras para Python Brasil
25
star
22

eslint-config-vinta

Vinta's ESLint and Prettier shareable configs.
JavaScript
23
star
23

github-metrics

Script for fetching github metrics for your project.
Python
22
star
24

devchecklists-template

The template to create your checklist on Devchecklists. https://devchecklists.com
22
star
25

hub.rocks

An online jukebox with all the songs from Deezer and YouTube. Built with Django and Angular.
Python
21
star
26

django-pg-tenants

Python
20
star
27

aurorae

🌅 🌇 A tool to generate fixed-width CNAB240 files to perform bulk payments
Python
20
star
28

cookiecutter-tapioca

A cookiecutter template for new tapioca wrappers
Python
19
star
29

GPTBundle

GPTBundle, a React application toolkit, harnesses AI to convert textual content into structured forms and delivers advanced autofill suggestions.
TypeScript
19
star
30

critical-incidents-checklist

Useful checklist for dealing with recovery crisis. Based on the talk "Saving Great Projects" 2017 Python Brasil
18
star
31

production-launch-checklist

A checklist we use here at Vinta before launching a product we've been working on.
16
star
32

tapioca-twitter

Twitter API wrapper using tapioca
Python
16
star
33

safari-samesite-cookie-issue

A Django 2.1 project to reproduce WebKit Bug 188165 and Django Ticket #30250
Python
15
star
34

pull-requests-checklist

Do's and Don'ts for Pull Requests. Improve code quality and review speed.
14
star
35

tapioca-instagram

Python
11
star
36

normalizr-redux-talk

Repository containing demo and resouces for the Normalizr Redux talk
JavaScript
8
star
37

django-psi

Easily integrate Google PageSpeed Insights to your development process - with timeline visualization
Python
8
star
38

pythonwat

Slides of Python WAT talk, see README for interactive version. Or go to http://vintasoftware.github.io/pythonwat/ for the HTML compiled one. PT-BR only for now.
Jupyter Notebook
8
star
39

vinta-design-checklists

Design checklists made by our internal team
7
star
40

high-quality-software-standards-checklists

A checklist we use here at Vinta to ensure high quality software at scale
6
star
41

devchecklists.com-content

devchecklists.com
TypeScript
6
star
42

palestra-normalizacao-django

Exemplos para a palestra "Normalize até machucar, desnormalize até funcionar em Django" da Python Nordeste 2018
Python
6
star
43

vinta-feedback-checklists

Checklists about the feedback process.
6
star
44

django-data-watcher

Python
6
star
45

django-upload-files-straight-to-s3

Example project on how to upload files from the frontend straight to S3 without sending to the server using django
Python
6
star
46

bug-card-creation-checklist

Checklist with what we believe should be written whenever a bug card is written.
5
star
47

django-bug-finder

Python
5
star
48

pr-reviewer-checklist

Set of guidelines for anyone reviewing a PR to make sure it's more civilized and avoid creating a toxic feedback culture
5
star
49

ab-testing-checklist

A checklist we use when creating A/B tests on our projects
5
star
50

tapioca-youtube

Youtube API Wrapper using Tapioca
Python
4
star
51

feature-card-creation-checklist

Checklist on how to best describe on a Trello card what needs to be done in a feature
4
star
52

weekly-meeting-checklist

Checklist to avoid wasting time in meetings, focusing them on sharing knowledge, instead of updates.
4
star
53

trigger.io-tcp

trigger.io module for TCP sockets
Java
4
star
54

dojo

Python
4
star
55

sprint-meeting-checklist

Checklist with preparations we at Vinta do for every sprint meeting we
4
star
56

landing-page-creation-checklist

Checklist of concerns one must have whenever creating a landing page completely dissociated from your original homepage
4
star
57

dedupe-clustering-experiments

Experimenting new types of clustering algorithms for Dedupe library
Jupyter Notebook
4
star
58

tapioca-mailgun

Mailgun API wrapper using tapioca
Python
4
star
59

tapioca-bitbucket

Bitbucket API wrapper using tapioca
Python
3
star
60

django-sass-bower-compressor-example

Python
3
star
61

cordova-toast-plugin

Cordova toast message plugin
Java
3
star
62

celery-persistent-revokes

Celery task revokes are stored on memory or on file. This packages makes possible to easely customize how your revokes are stored (Ex.: Database).
Python
3
star
63

django-stack

Python
2
star
64

feature-development-workflow

Developing a feature is much more than just coding it what was specified. This checklist covers other points that are important for code quality and a smoother hand-off.
2
star
65

tapioca-parse

Parse REST API wrapper using tapioca
Python
2
star
66

github-monitor

Python
2
star
67

react-jest-blog-post

JavaScript
2
star
68

django-linters-talk-demo

Demos for DjangoCon 2017 talk: Preventing headaches with linters and automated checks
Python
2
star
69

user-documentation-checklists

General guidelines on how to build awesome SaaS user documentation!
2
star
70

tapioca-mandrill

Mandrill API wrapper using tapioca
Python
2
star
71

medplum-snippet-catalog

A collection of reusable code snippets and components for Medplum projects.
TypeScript
2
star
72

identity-validation

Project to validate we're actually members of Vinta Software
1
star
73

tapioca-harvest

Harvest wrapper using tapioca
Python
1
star
74

react-flux-example

JavaScript
1
star
75

rise-jupyter-talk

Slides da talk "Fazendo apresentações real-time com Jupyter" da Python Sudeste 2016
OpenEdge ABL
1
star