• Stars
    star
    232
  • Rank 172,847 (Top 4 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 6 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Java library to determine probability of objects being similar.

Fuzzy-Matcher

Introduction

A java-based library to match and group "similar" elements in a collection of documents.

Imagine working in a system with a collection of contacts and wanting to match and categorize contacts with similar names, addresses or other attributes. The Fuzzy Match matching algorithm can help you do this. The Fuzzy Match algorithm can even help you find duplicate contacts, or prevent your system from adding duplicates.

This library can act on any domain object, like contact, and find similarity for various use cases. It dives deep into each character and finds out the probability that 2 or more objects are similar.

What's Fuzzy

The contacts "Steven Wilson" living at "45th Avenue 5th st." and "Stephen Wilkson" living at "45th Ave 5th Street" might look like belonging to the same person. It's easy for humans to ignore the small variance in spelling in names, or ignore abbreviation used in address. But for a computer program they are not the same. The string Steven does not equals Stephen and neither does Street equals st. If our trusted computers can start looking at each character and the sequence in which they appear, it might look similar. Fuzzy matching algorithms is all about providing this level of magnification to our myopic machines.

How does this work

Breaking down your data

This algorithm accepts data in a list of entities called Document (like a contact entity in your system), which can contain 1 or more Element (like names, address, emails, etc). Internally each element is further broken down into 1 or more Token which are then matched using configurable MatchType

This combination to tokenize the data and then to match them can extract similarity in a wide variety of data types

Exact word match

Consider these Elements defined in two different Documents

  • Wayne Grace Jr.
  • Grace Hilton Wayne

With a simple tokenization process each word here can be considered a token, and if another element has the same word they are scored on the number of matching tokens. In this example the words Wayne and Grace match 2 words out of 3 total in each elements. A scoring mechanism will match them with a result of 0.67

Soundex word match

Consider these Elements in two different Documents

  • Steven Wilson
  • Stephen Wilkson

Here we do not just look at each word, but encode it using Soundex which gives a unique code for the phonetic spelling of the name. So in this example words Steven & Stephen will encode to S315 whereas the words Wilson & Wilkson encode to W425.

This allows both the elements to match exactly, and score at 1.0

NGram token match

In cases where breaking down the Elements in words is not feasible, we split it using NGrams. Take for examples emails

Here if we ignore the domain name and take 3 character sequence (tri-gram) of the data, tokens will look like this

  • parker.james -> [par, ark, rke, ker, er., r.j, .ja, jam, ame, mes]
  • james_parker -> [jam, ame, mes, es_, s_p, _pa, par, ark, rke, ker]

Comparing these NGrams we have 7 out of the total 10 tokens match exactly which gives a score of 0.7

Nearest Neighbors match

In certain cases breaking down elements into tokens and comparing tokens is not an option. For example numeric values, like dollar amounts in a list of transactions

  • 100.54
  • 200.00
  • 100.00

Here the first and third could belong to the same transaction, where the third is only missing some precession. The match is done not on tokens being equal but on the closeness (the neighborhood range) in which the values appear. This closeness is again configurable where a 99% closeness, will match them with a score of 1.0

A similar example can be thought of with Dates, where dates that are near to each other might point to the same event.

Four Stages of Fuzzy Match

Fuzzy Match

We spoke in detail on Token and MatchType which is the core of fuzzy matching, and touched upon Scoring which gives the measure of matching similar data. PreProcessing your data is a simple yet powerful mechanism that can help in starting with clean data before running a match. These 4 stages which are highly customizable can be used to tune and match a wide variety of data types

  • Pre-Processing : This accepts a java Function. Which allows you to externally develop the pre-processing functionality and pass it to the library. Or use some of the existing ones. These are a few examples that are already available

    • Trim: Removes leading and trailing spaces (applied by default)
    • Lower Case: Converts all characters to lowercase (applied by default)
    • Remove Special Chars : Removes all characters except alpha and numeric characters and spaces. (default for TEXT type)
    • Numeric: Strips all non-numeric characters. Useful for numeric values like phone or ssn (default for NUMBER type)
    • Email: Strips away domain from an email. This prevents common domains like gmail.com, yahoo.com to be considered in match (default for EMAIL type)
  • Tokenization : This again accepts a Function so can be externally defined and fed to the library. Some commonly used are already available.

    • Word : Breaks down an element into words (anything delimited by space " ").
    • N-Gram : Breaks down an element into 3 letter grams.
    • Word-Soundex : Breaks down in words (space delimited) and gets Soundex encode using the Apache Soundex library
    • Value : Nothing to break down here, just uses the element value as token. Useful for Nearest Neighbor matches
  • Match Type : Allows 2 types of matches, which can be applied to each Element

    • Equality: Uses exact matches with token values.
    • Nearest Neighbor: Finds tokens that are contained in the neighborhood range, that can be specified as a probability (0.0 - 1.0) for each element. It defaults to 0.9
  • Scoring : These are defined for Element and Document matches

    • Element scoring: Uses a simple average, where for each element the matching token is divided by the total tokens. A configurable threshold can be set for each element beyond which elements are considered to match (default set at 0.3)
    • Document scoring: A similar approach where number of matching elements are compared with total element. In addition, each element can be give a weight. This is useful when some elements in a document are considered more significant than others. A threshold can also be specified at a document level (defaults to 0.5) beyond which documents are considered to match

End User Configuration

All the configurable options defined above can be applied at various points in the library.

Predefined Element Types

Below is the list of predefined Element Types available with sensible defaults. These can be overridden by setters while creating an Element.

Element Type PreProcessing Function Tokenizer Function Match Type
NAME namePreprocessing() wordSoundexEncodeTokenizer() EQUALITY
TEXT removeSpecialChars() wordTokenizer() EQUALITY
ADDRESS addressPreprocessing() wordSoundexEncodeTokenizer() EQUALITY
EMAIL removeDomain() triGramTokenizer() EQUALITY
PHONE numericValue() decaGramTokenizer() EQUALITY
NUMBER numberPreprocessing() valueTokenizer() NEAREST_NEIGHBORS
DATE none() valueTokenizer() NEAREST_NEIGHBORS
AGE numberPreprocessing() valueTokenizer() NEAREST_NEIGHBORS

Note: Since each element is unique in the way it should match, if you need to match a different element type than what is supported, please open a new GitHub Issue and the community will provide support and enhancement to this library

Document Configuration

  • Key: Required field indicating unique primary key of the document
  • Elements: Set of elements for each document
  • Threshold: A double value between 0.0 - 1.0, above which the document is considered as match.

Element Configuration

  • Value : String representation of the value to match
  • Type : These are predefined elements, which apply relevant functions for "PreProcessing", "Tokenization" and "MatchType"
  • Variance: (Optional) To differentiate same element types in a document. eg. a document containing 2 NAME element one for "user" and one for "spouse"
  • Threshold: A double value between 0.0 - 1.0, above which the element is considered as match.
  • Weight: A value applied to an element to increase or decrease the document score. The default is 1.0, any value above that will increase the document score if that element is matched.
  • PreProcessingFunction: Override The PreProcessingFunction function defined by Type
  • TokenizerFunction: Override The TokenizerFunction function defined by Type
  • MatchType: Override the MatchType defined by Type
  • NeighborhoodRange: Relevant only for NEAREST_NEIGHBORS MatchType. Defines how close should the Value be, to be considered a match. Accepted values between 0.0 - 1.0 (defaults to 0.9)

Match Service

It supports 3 ways to match the documents

  • Match a list of Documents: This is useful if you have an existing list of documents, and want to find out which of them might have potential duplicates. A typical de-dup use case
matchService.applyMatchByDocId(List<Document> documents)
  • Match a list of Documents with an Existing List: This is useful for matching a new list of documents with an existing list in your system. For example, if you're performing a bulk import and want to find out if any of them match with existing data
matchService.applyMatchByDocId(List<Document> documents, List<Document> matchWith)
  • Match a Document with Existing List: This is useful when a new document is being created and you want to ensure that a similar document does not already exist in your system
matchService.applyMatchByDocId(Document document, List<Document> matchWith)

Match Results

The response of the library is essentially a Match<Document> object. It has 3 attributes

  • Data: This is the source Document on which the match is applied
  • MatchedWith: This is the target Document that the data matched with
  • Result: This is the probability score between 0.0 - 1.0 indicating how similar the 2 documents are

The response is grouped by the Data.key, so from any of the MatchService methods the response is map

Map<String, List<Match<Document>>>

Quick Start

Maven Import

The library is published to maven central

<dependency>
    <groupId>com.intuit.fuzzymatcher</groupId>
    <artifactId>fuzzy-matcher</artifactId>
    <version>1.2.0</version>
</dependency>

(Note: This requires java 11. For java 8 use version 1.1.x)

Input

This library takes a collection of Document objects with various Elements as input.

For example, if you have a multiple contacts as a simple String Arrays

String[][] input = {
        {"1", "Steven Wilson", "45th Avenue 5th st."},
        {"2", "John Doe", "546 freeman ave"},
        {"3", "Stephen Wilkson", "45th Ave 5th Street"}
};

Convert them into List of Document

List<Document> documentList = Arrays.asList(input).stream().map(contact -> {
    return new Document.Builder(contact[0])
            .addElement(new Element.Builder<String>().setValue(contact[1]).setType(NAME).createElement())
            .addElement(new Element.Builder<String>().setValue(contact[2]).setType(ADDRESS).createElement())
            .createDocument();
}).collect(Collectors.toList());

Applying the Match

The entry point for running this program is through MatchService class. Create a new instance of Match service, and use applyMatch methods to find matches

MatchService matchService = new MatchService();
Map<String, List<Match<Document>>> result = matchService.applyMatchByDocId(documentList);

Output

This prints the result to console. This should show a match between the 1st and 3rd document, but not the 2nd.

result.entrySet().forEach(entry -> {
    entry.getValue().forEach(match -> {
        System.out.println("Data: " + match.getData() + " Matched With: " + match.getMatchedWith() + " Score: " + match.getScore().getResult());
    });
});

Performance

For most real life data-sets, the size of the data I am sure is not as simple as shown in the examples.
Since this library can be used to match elements against a large set of records, knowing how it performs is essential.

The performance characteristics varies primarily on MatchType being used

  • EQUALITY - For equality match, which is the default for most Element Types, the performance is linear O(N). Where N is the number of Element in all the document.

  • NEAREST_NEIGHBOR - The default for Numeric and Date Element Types the performance is O(N logN). This also depends on the NeighborhoodRange setting , the higher the value the better it will perform. It is advisable to not use 1.0 as a NeighborhoodRange and instead over-ride the MatchType to be EQUALITY, that way it guarantees a linear performance.

The following chart shows the performance characteristics of this library as the number of elements increase. As you can see, the library maintains a near-linear performance and can match thousands of elements within seconds on a multi-core processor.

Perf

More Repositories

1

karate

Test Automation Made Simple
Java
5,080
star
2

LocationManager

Easily get the device's current location on iOS.
Objective-C
2,562
star
3

CardParts

A reactive, card-based UI framework built on UIKit for iOS developers.
Swift
2,521
star
4

auto

Generate releases based on semantic version labels on pull requests.
TypeScript
2,262
star
5

sdp

An Android lib that provides a new size unit - sdp (scalable dp). This size unit scales with the screen size.
2,256
star
6

wasabi

Wasabi A/B Testing service is an open source project that is no longer under active development or being supported
Java
1,134
star
7

AnimationEngine

Easily build advanced custom animations on iOS.
Objective-C
1,059
star
8

ssp

Variant of sdp project based on the sp size unit.
542
star
9

design-systems-cli

A CLI toolbox for creating design systems.
TypeScript
401
star
10

devtools-ds

UI components, libraries, and templates for building robust devtools experiences.
TypeScript
250
star
11

QuickBooks-V3-PHP-SDK

Official PHP SDK for QuickBooks REST API v3.0: https://developer.intuit.com/
PHP
242
star
12

GroupedArray

An Objective-C and Swift collection for iOS and OS X that stores objects grouped into sections.
Objective-C
216
star
13

katlas

A distributed graph-based platform to automatically collect, discover, explore and relate multi-cluster Kubernetes resources and metadata.
Go
209
star
14

superglue

Superglue is a lineage-tracking tool built to help visualize the propagation of data through complex pipelines composed of tables, jobs and reports.
Scala
155
star
15

truffle-shuffle

An Android data-driven, percentage-based UI Card Gallery Library
Kotlin
149
star
16

maven-build-scanner

Know your build - so you can make it faster
Java
148
star
17

benten

Chatbot Development Framework (with Slack integration for Jira and Jenkins)
Java
134
star
18

foremast

Foremast adds application resiliency to Kubernetes by leveraging machine learnt patterns of application health to keep applications healthy and stable
Java
131
star
19

oauth-jsclient

Intuit's NodeJS OAuth client provides a set of methods to make it easier to work with OAuth2.0 and Open ID
JavaScript
124
star
20

costBuddy

costBuddy will gather cost information from multiple AWS accounts and generate a nice Grafana dashboard with alerting in place.
Python
112
star
21

QuickBooks-V3-DotNET-SDK

.Net SDK for QuickBooks REST API v3 services
C#
107
star
22

Trapheus

This tool automates restoration of RDS database instances from snapshots into any dev, staging or production environments. It supports individual RDS Snapshot as well as cluster snapshot restore operations.
Python
106
star
23

Ignite

Modern markdown documentation generator
JavaScript
103
star
24

accessibility-snippets

VSCode Snippets created to help developers write accessible code.
JavaScript
102
star
25

fawkes

🚀🚀 Fetch, parse, categorize, summarize user reviews 🚀🚀
Python
92
star
26

proof

A tapable integration testing library for your Storybook stories
TypeScript
86
star
27

Tank

Tank is a downloadable application that can be used to load test websites
Java
84
star
28

aws_account_utils

Deprecated - Utility to help create and modify your AWS account
Ruby
81
star
29

graphql-filter-java

This project is developed to help developers add filtering support to their graphql-java services
Java
70
star
30

oauth-pythonclient

The Python OAuth client provides a set of methods that make it easier to work with Intuit's OAuth and OpenID Connect implementation.
Python
70
star
31

automation-for-humans

Converts English statements to automation.
Python
67
star
32

QuickBooks-V3-Java-SDK

Java SDK for QuickBooks REST API v3 services
Java
66
star
33

simple_deploy

Maintenance Mode - Simple Deploy is an opinionated CLI tool for managing AWS Cloud Formation Stacks.
Ruby
64
star
34

postcss-themed

A PostCSS plugin for generating themes.
TypeScript
62
star
35

autometer

Distributed load testing made simple
Shell
57
star
36

commently

😀💬 Easily comment and update comments on GitHub PRs
TypeScript
57
star
37

AnimatedFormFieldTableViewCell

UITextField for iOS that enables the user to see both the Input Text and the Placeholder
Swift
56
star
38

AutoRemoveObserver

iOS Auto-removing NSNotifications
Objective-C
53
star
39

judo

Judo is an easy-to-use Command Line Interface (CLI) Integration Testing Framework, driven from a simple yaml file that also contains assertions.
JavaScript
51
star
40

Traverser

Traverser is a Java library that helps software engineers implement advanced iteration of a data structure.
Java
49
star
41

react-json-reconciler

This project leverages the react-reconciler to allow users to serialize JSX trees into JSON objects.
TypeScript
48
star
42

intuit-developer-nodejs

A starting point for anyone looking to quickly jump onto the Intuit Developer Platform, Intuit-developer-nodejs ties together OAuth, OpenID, NodeJS, QuickBooks APIs and SDK.
JavaScript
46
star
43

DockDockBuild

Support for running UNIX Makefiles on a Docker container
Kotlin
45
star
44

bias-detector

Python
44
star
45

xhr-xdr-adapter

Enables (to the extent possible) support for Cross Origin Resource Sharing (CORS) on IE versions 8 and 9
JavaScript
41
star
46

user-data-for-fraud-prevention

Simple npm package with a utility to collect data from the browser required for compliance with fraud prevention APIs.
TypeScript
40
star
47

hooks

Hooks is a little module for plugins, in Kotlin
Kotlin
39
star
48

ami-query

Provide a REST interface to your organization's AMIs
Go
39
star
49

qb-animation-library

CSS and SCSS for adding QuickBooks animation to your project.
CSS
38
star
50

cyphfell

Converts WDIO to Cypress
JavaScript
34
star
51

sac3

Official repo for SAC3: Reliable Hallucination Detection in Black-Box Language Models via Semantic-aware Cross-check Consistency
Jupyter Notebook
33
star
52

storybook-addon-sketch

A Storybook add-on to get the contents of the current story as a Sketch file
TypeScript
31
star
53

CloudRaider

A resiliency tool that automates Failure mode effect analysis tests, simplifying complex testing with a behavior-driven development and testing approach. Provides a programmatic way to execute controlled failures in AWS and a BDD way to write test cases, allowing test plans themselves to become test cases that can be executed as is.
Java
30
star
54

oauth-rubyclient

Ruby OAuth 2.0 client for QuickBooks Online
Ruby
29
star
55

saloon

An E2E test seeder for enterprise web applications
JavaScript
29
star
56

identity-authz-apl

Attribute-based access control (ABAC), also known as policy-based access control, defines an access control paradigm whereby access rights are granted to users through the use of policies which reason over data in attributes. The policies can use any type of attributes (user attributes, resource attributes, object, environment attributes etc.). Read more here - https://en.wikipedia.org/wiki/Attribute-based_access_control ABAC Policy Language is used by ABAC to author policies. A policy consists of rules, which have "when" conditions and "then" actions. Policies are executed in a bounded time, goaled to reach a decision as quickly as possible in deterministic, fast and reliable way. Further light-weight execution consumes minimal resources.
Java
28
star
57

Decision-Trees-over-FHE

Decision trees training and prediction over encrypted data using Fully Homomorphic Encryption
C++
26
star
58

QuickFabric

A one-stop shop for all management and monitoring of Amazon Elastic Map Reduce (EMR) clusters across different AWS accounts and purposes.
JavaScript
26
star
59

metriks

Python package of commonly used metrics for evaluating information retrieval models.
Python
25
star
60

intuit-spring-cloud-config-inspector

Inspection of Spring Cloud Config properties made easy using React
JavaScript
25
star
61

mlctl

mlctl is the control plane for MLOps. It provides a CLI and a Python SDK for supporting key operations related to MLOps, such as "model training", "model hosting" etc.
Python
25
star
62

RBHC

This project implements machine learning to accomplish recursive binary hierarchical clustering of data primarily useful for any clickstream data along with providing cluster statistics for each cluster and visualization using d3js
Python
25
star
63

eslint-plugin-no-explicit-type-exports

A plugin to guard against exporting imported types.
TypeScript
24
star
64

text-provider

A react component which provides all the string constants using provider pattern
JavaScript
24
star
65

istanbul-cobertura-badger

Create a Code Coverage badge for Node.js Apps running node-istanbul.
JavaScript
24
star
66

LD-React-Components

Semantic component helpers to support LaunchDarkly feature flags in your React app.
JavaScript
24
star
67

ts-readme

Generate docs from typescript and put it in a README
TypeScript
23
star
68

WeakForwarder

Objective-C NSProxy class for iOS and OS X to allow for real weak delegates.
Objective-C
23
star
69

doc-blocks

A design system for doc-blocks UI components, built on @design-systems/cli.
TypeScript
22
star
70

node-pom-parser

Parsing Java's pom.xml and properly returning the json object, including attributes and values.
TypeScript
22
star
71

PHP-Payments-SDK

QuickBooks Online Payments SDK
PHP
20
star
72

rego

A command-line batch interface to the RuleFit statistical model building program.
R
20
star
73

innersource-scanner

A java api and command line tool for scanning, reporting and fixing a git repository's InnerSource Readiness based on a supplied specification which defines the files and file contents necessary for a repository to be considered ready for InnerSource contribution.
Java
20
star
74

universal-graph-client

A Java library that provides single API and a CLI to connect to all varieties of graph databases.
Java
19
star
75

funnel

A Go library that provides unification of identical operations (e.g. API requests).
Go
18
star
76

gitdetect

A GitHub scanning tool to help you find misplaced secrets in your source code repository files
Go
17
star
77

foremast-brain

Foremast-brain is a component of Foremast project.
Jupyter Notebook
17
star
78

naavik

Go
16
star
79

ReplayWeb

ReplayWeb is a collection of tools to accelerate building and maintaining functional tests for user interfaces.
JavaScript
16
star
80

intuit-spring-cloud-config-validator

Validation tools for Spring Cloud Config repos: .json, .yam|, .yml and .properties, verified through script or GitHub Pre-receive Hook!
Python
16
star
81

cfn-deploy

A useful GitHub Action to help you deploy cloudformation templates
Shell
15
star
82

heirloom

Maintenance Mode - Build, deploy and manage archives and their metadata in S3 and SimpleDB.
Ruby
15
star
83

apollo-mock-http

An easy and maintainable way of injecting mock data into Apollo GraphQL Client for fast feature development decoupled from API/Backend.
JavaScript
14
star
84

semantic-release-slack

A plugin for semantic-release that takes a Slack web hook and posts a message when a release is successful
JavaScript
14
star
85

dse-pronto

Pronto is an automation suite for deploying and managing DataStax Cassandra clusters in AWS.
Shell
14
star
86

spring-pulsar

Spring client library for apache pulsar allows consuming applications to integrate easily with apache pulsar.
Kotlin
13
star
87

go-loadgen

go-loadgen is a log infrastructure testing tool. Also suitable for load testing big data pipelines
Go
13
star
88

standardly

Standardly allows you to check for compliance against standards. Once you code your standards into a 'rules' json object, you can scan a directory on your filesystem or a GitHub repo to check for its compliance against the standard.
JavaScript
13
star
89

graphql-orchestrator-java

GraphQL Orchestrator stitches the schemas from multiple micro-services and orchestrates the graphql queries to these services accurately at runtime
Groovy
12
star
90

scss-cleanup-scripts

Shell scripts for removing redundant Sass files, variables, mixins and deleting unused images
Shell
12
star
91

unmazedboot

🐳 Generic SpringBoot Docker files and image management 🍃
Dockerfile
12
star
92

spring-config-client-fallback

Spring Cloud Config Client with Fallback implementation for cases when the the config server is down
Java
11
star
93

Autumn

Micro-services injectable infrastructure project. Autumn enables rapid development of mico-service applications.
Java
11
star
94

sdbport

Maintenance Mode - Import / Export SimpleDB Domains.
Ruby
11
star
95

mastko

MasTKO is a security tool which detects DNS entries associated with AWS’s EC2 servers susceptible to takeover attack and attempts a takeover.
Python
10
star
96

lean-schema

Shrink your large GraphQL Schema to only what you need with Intuit LeanSchema!
Python
10
star
97

postgres-perfstats

Python
10
star
98

swift-hooks

A little module for plugins, in swift.
Swift
10
star
99

cfn-clone

CLI to clone cloud formation stacks
Go
10
star
100

thrive

Thrive is an ETL framework that runs single-row transformations on HDFS data and makes the data available in relational databases (Hive and Vertica).
Python
10
star