• Stars
    star
    667
  • Rank 65,037 (Top 2 %)
  • Language
    Ruby
  • License
    MIT License
  • Created over 12 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Find a needle (a document or record) in a haystack using string similarity and (optionally) regular expression rules. Uses Dice's Coefficient (aka Pair Similiarity) and Levenshtein Distance internally.

Top 3 reasons you should use FuzzyMatch

  1. intelligent defaults: it uses a combination of Pair Distance (2-gram) and Levenshtein Edit Distance to effectively match many examples with no configuration
  2. all-vs-all: it takes care of finding the optimal match by comparing everything against everything else (when that's necessary)
  3. refinable: you might get to 90% with no configuration, but if you need to go beyond you can use regexps, grouping, and stop words

It solves many mid-range matching problems — if your haystack is ~10k records — if you can winnow down the initial possibilities at the database level and only bring good contenders into app memory — why not give it a shot?

FuzzyMatch

Find a needle in a haystack based on string similarity and regular expression rules.

Replaces loose_tight_dictionary because that was a confusing name.

Warning! normalizers are gone in version 2 and above! See the CHANGELOG and check out enhanced (and hopefully more intuitive) groupings.

diagram of matching process

Quickstart

>> require 'fuzzy_match'
=> true
>> FuzzyMatch.new(['seamus', 'andy', 'ben']).find('Shamus')
=> "seamus"

See also the blog post Fuzzy match in Ruby.

Default matching (string similarity)

At the core, and even if you configure nothing else, string similarity (calculated by "pair distance" aka Dice's Coefficient) is used to compare records.

You can tell FuzzyMatch what field or method to use via the :read option... for example, let's say you want to match a Country object like #<Country name:"Uruguay" iso_3166_code:"UY">

>> fz = FuzzyMatch.new(Country.all, :read => :name)
=> #<FuzzyMatch: [...]>
>> fz.find('youruguay')
=> #<Country name:"Uruguay" iso_3166_code:"UY">

Optional rules (regular expressions)

You can improve the default matchings with rules. There are 3 different kinds of rules. Each rule is a regular expression.

We suggest that you first try without any rules and only define them to improve matching, prevent false positives, etc.

Groupings

Group records together. The two laws of groupings:

  1. If a needle matches a grouping, only compare it with straws in the same grouping; (the "buddies vs buddies" rule)
  2. If a needle doesn't match any grouping, only compare it with straws that also don't match ANY grouping (the "misfits vs misfits" rule)

The two laws of chained groupings: (new in v2.0 and rather important)

  1. Sub-groupings (e.g., /plaza/i below) only match if their primary (e.g., /ramada/i) does
  2. In final grouping decisions, sub-groupings win over primaries (so "Ramada Inn" is NOT grouped with "Ramada Plaza", but if you removed /plaza/i sub-grouping, then they would be grouped together)

Hopefully they are rather intuitive once you start using them.

screenshot of spreadsheet of groupings

That will...

  • separate "Orient Express Hotel" and "Ramada Conference Center Mandarin" from real Mandarin Oriental hotels
  • keep "Trump Hotel Collection" away from "Luxury Collection" (another real hotel brand) without messing with the word "Luxury"
  • make sure that "Ramada Plaza" are always grouped with other RPs—and not with plain old Ramadas—and vice versa
  • splits out Hyatts into their different brands
  • and more

You specify chained groupings as arrays of regexps:

groupings = [
  /mandarin/i,
  /trump/i,
  [ /ramada/i, /plaza/i ],
  ...
]
fz = FuzzyMatch.new(haystack, groupings: groupings)

This way of specifying groupings is meant to be easy to load from a CSV, like bin/fuzzy_match does.

Formerly called "blockings," but that was jargon that confused people.

Identities

Prevent impossible matches. Can be very confusing—see if you can make things work with groupings first.

Adding an identity like /(f)-?(\d50)/i ensures that "Ford F-150" and "Ford F-250" never match.

Note that identities do not establish certainty. They just say whether two records could be identical... then string similarity takes over.

Stop words

Ignore common and/or meaningless words when doing string similarity.

Adding a stop word like THE ensures that it is not taken into account when comparing "THE CAT", "THE DAT", and "THE CATT"

Stop words are NOT removed when checking :must_match_at_least_one_word and when doing identities and groupings.

Find options

  • read: how to interpret each record in the 'haystack', either a Proc or a symbol
  • must_match_grouping: don't return a match unless the needle fits into one of the groupings you specified
  • must_match_at_least_one_word: don't return a match unless the needle shares at least one word with the match. Note that "Foo's" is treated like one word (so that it won't match "'s") and "Bolivia," is treated as just "bolivia"
  • gather_last_result: enable last_result

Case sensitivity

String similarity is case-insensitive. Everything is downcased before scoring. This is a change from previous versions.

Be careful with uppercase letters in your rules; in general, things are downcased before comparing.

String similarity algorithm

The algorithm is Dice's Coefficient (aka Pair Distance) because it seemed to work better than Longest Substring, Hamming, Jaro Winkler, Levenshtein (although see edge case below) etc.

Here's a great explanation copied from the wikipedia entry:

to calculate the similarity between:

    night
    nacht

We would find the set of bigrams in each word:

    {ni,ig,gh,ht}
    {na,ac,ch,ht}

Each set has four elements, and the intersection of these two sets has only one element: ht.

Inserting these numbers into the formula, we calculate, s = (2 · 1) / (4 + 4) = 0.25.

Edge case: when Dice's fails, use Levenshtein

In edge cases where Dice's finds that two strings are equally similar to a third string, then Levenshtein distance is used. For example, pair distance considers "RATZ" and "CATZ" to be equally similar to "RITZ" so we invoke Levenshtein.

>> 'RITZ'.pair_distance_similar 'RATZ'
=> 0.3333333333333333 
>> 'RITZ'.pair_distance_similar 'CATZ'
=> 0.3333333333333333                   # pair distance can't tell the difference, so we fall back to levenshtein...
>> 'RITZ'.levenshtein_similar 'RATZ'
=> 0.75 
>> 'RITZ'.levenshtein_similar 'CATZ'
=> 0.5                                  # which properly shows that RATZ should win

Cached results

Make sure you add active_record_inline_schema to your gemfile.

TODO write documentation. For now, please see how we manually cache matches between aircraft and flight segments.

Glossary

The admittedly imperfect metaphor is "look for a needle in a haystack"

  • needle: the search term
  • haystack: the records you are searching (your result will be an object from here)

Using amatch to make it faster

You can optionally use amatch by Florian Frank (thanks Flori!) to make string similarity calculations in a C extension.

require 'fuzzy_match'
require 'amatch' # note that you have to require this... fuzzy_match won't require it for you
FuzzyMatch.engine = :amatch

Otherwise, pure ruby versions of the string similarity algorithms derived from the answer to a StackOverflow question and the text gem are used. Thanks marzagao and threedaymonk!

Real-world usage

Brighter Planet logo

We use fuzzy_match for data science at Brighter Planet and in production at

We often combine it with remote_table and errata:

  • download table with remote_table
  • correct serious or repeated errors with errata
  • fuzzy_match the rest

Contributors

Copyright

Copyright 2013 Seamus Abshere

More Repositories

1

upsert

Upsert on MySQL, PostgreSQL, and SQLite3. Transparently creates functions (UDF) for MySQL and PostgreSQL; on SQLite3, uses INSERT OR IGNORE.
Ruby
655
star
2

data_miner

Download, unpack from a ZIP/TAR/GZ/BZ2 archive, parse, correct, convert units and import Google Spreadsheets, XLS, ODS, XML, CSV, HTML, etc. into your ActiveRecord models. Uses RemoteTable gem internally.
Ruby
301
star
3

remote_table

Open local or remote XLSX, XLS, ODS, CSV (comma separated), TSV (tab separated), other delimited, fixed-width files, and Google Docs. Returns an enumerator of Arrays or Hashes, depending on whether there are headers.
HTML
226
star
4

unix_utils

Like FileUtils, but provides zip, unzip, bzip2, bunzip2, tar, untar, sed, du, md5sum, shasum, cut, head, tail, wc, unix2dos, dos2unix, iconv, curl, perl, etc.
Ruby
226
star
5

cache_method

Cache based on arguments AND object state; store in memcached, redis, or in-process. Like alias_method, but it's cache_method! One step beyond memoization.
Ruby
135
star
6

lock_and_cache

Most caching libraries don't do locking, meaning that >1 process can be calculating a cached value at the same time. Since you presumably cache things because they cost CPU, database reads, or money, doesn't it make sense to lock while caching?
Ruby
134
star
7

mysql2xxxx

Gives you binaries like mysql2csv, mysql2json, and mysql2xml, and Ruby classes to match.
Ruby
84
star
8

cache

Defines a simple interface to multiple cache-like storage engines by wrapping common Ruby client libraries like memcached, redis, memcache-client, dalli. Handles each underlying library's weirdnesses, including forking.
Ruby
69
star
9

eat

A (better?) replacement for open-uri. Lets you open local and remote files by immediately returning their contents as a string.
Ruby
32
star
10

to_regexp

Provides String#to_regexp
Ruby
27
star
11

errata

Define an errata in table format (CSV) and then apply it to an arbitrary source. Inspired by RFC Errata, lets you keep your own errata in a transparent way.
Ruby
21
star
12

cacheable

DEPRECATED. Use cache_method instead.
Ruby
20
star
13

py-upsert

Python library to make it easy to upsert on MySQL, PostgreSQL, and SQLite3.
Python
18
star
14

report

DSL for creating clean CSV, XLSX, and PDF reports in Ruby. Uses xlsx_writer, prawn and pdftk internally.
Ruby
16
star
15

database_url

Convert back and forth between Heroku-style ENV['DATABASE_URL'] and Rails/ActiveRecord-style config/database.yml hashes.
Ruby
16
star
16

lock_method

Like alias_method, but it's lock_method! (lockfiles)
Ruby
12
star
17

common_name

Helps you stop using chains of humanize/downcase/underscore/pluralize/to_sym/etc everywhere in your models, your views, your controllers, etc.
Ruby
11
star
18

engineyard-metadata

Presents a simple, unchanging interface to get metadata about your EngineYard AppCloud instances running on Amazon EC2.
Ruby
10
star
19

cohort_analysis

TBD
Ruby
10
star
20

create_table

Analyze and inspect CREATE TABLE SQL statements and translate across databases. Uses Ragel internally for parsing.
Ruby
10
star
21

ruby_ragel_examples

Examples of using ragel and ruby together
Ruby
9
star
22

fuzzy_infer

Fuzzy set analysis - predicts one or more unknown characteristics of an input case by comparing its known characteristics to a reference dataset whose records contain both the known and unknown characteristics.
Ruby
8
star
23

the_geom_geojson

For PostGIS/PostgreSQL and ActiveRecord, provides "the_geom_geojson" getter and setter that update "the_geom" and "the_geom_webmercator" columns.
Ruby
8
star
24

validates_decency_of

Rails plugin that uses George Carlin's list of seven dirty words (aka swear words, aka cuss words, aka bad words) to check for "decency" on ActiveRecord model attributes.
Ruby
6
star
25

weighted_average

Aircraft.average(:seats) versus Aircraft.weighted_average(:seats, :weighted_by => :takeoffs)
Ruby
6
star
26

loose_tight_dictionary

DEPRECATED: use fuzzy_match. Find a needle in a haystack using string similarity and (optionally) regexp rules.
Ruby
6
star
27

hash_digest

Generates non-cryptographic digests of Hashes (and Arrays) indifferent to key type (string or symbol) and ordering.
Ruby
5
star
28

redirect_routing

Ruby
5
star
29

pg_trgm

Ruby trigram similarity that is identical to Postgres's (almost)
Ruby
5
star
30

ey_cloud_awareness

DEPRECATED: use engineyard-metadata. Make your EngineYard cloud instances aware of each other.
Ruby
4
star
31

xml_split

Split XML files on an element, yielding (streaming, so constant memory usage) each node in turn. Uses sgrep2 internally; future versions should use a pure-Ruby SAX parser.
Ruby
3
star
32

characterizable

DEPRECATED. Use charisma instead.
Ruby
3
star
33

has_handle_fallback

Make it easy to use handles (callsigns/monikers/usernames) in URLs, even if they might be blank.
Ruby
3
star
34

table_warnings

Warn yourself of problems with your ActiveRecord tables.
Ruby
3
star
35

to_json_fix

TODO: one-line summary of your gem
Ruby
3
star
36

honeypot

TODO: one-line summary of your gem
Ruby
3
star
37

cohort_scope

DEPRECATED. Use cohort_analysis. Provides cohorts (in the form of ActiveRecord scopes) that dynamically widen until they contain a certain number of records.
Ruby
3
star
38

vector_embed

Vector embedding of strings, booleans, numerics, and arrays into LIBSVM / LIBLINEAR format.
Ruby
3
star
39

switches

Turn on and off parts of your code based on yaml files.
Ruby
3
star
40

flights1percent

1% flights
JavaScript
2
star
41

zip5

Convert United States zip codes to their correct Zip5 representation, even if they're missing a leading zero and/or they have the +4 suffix.
Ruby
2
star
42

zmq

Drop-in replacement for zmq gem with included binaries
Ruby
2
star
43

cvg

Like jq or grep for csv. Combine one or more CSVs while filtering on fields with regular expressions, whitelists, presence, missing, etc.
Ruby
2
star
44

json_to_csv_to_json

csv_to_json and json_to_csv
Ruby
2
star
45

has_timestamps

Rails plugin to add named timestamps to ActiveRecord models.
Ruby
2
star
46

string_enumerator

Given a string containing placeholders (like [color]), enumerate all of the possible strings resulting from filling those placeholders with replacements (like red, blue).
Ruby
2
star
47

nonrandomapp

Ruby
1
star
48

mini_record

mini_record-compat is DEPRECATED. Use original mini_record OR active_record_inline_schema instead.
Ruby
1
star
49

string_replacer

DEPRECATED/POINTLESS - use sed or augeas. Replace text in a file without disturbing the rest of the file.
Ruby
1
star
50

geocode_records

As long as you do very specific things... quickly re-geocode tables.
Ruby
1
star
51

force_schema

[DEPRECATED - use mini_record] Declare a table structure like an ActiveRecord migration and run 'force_schema!' whenever you want. For when you don't need up and down migrations.
Ruby
1
star
52

fast_timestamp

Rapidly and arbitrarily timestamp ActiveRecord records.
Ruby
1
star