• Stars
    star
    217
  • Rank 182,446 (Top 4 %)
  • Language
    Python
  • Created almost 13 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple fuzzy matching set for python strings

fuzzyset - A fuzzy string set for python.

fuzzyset is a data structure that performs something akin to fulltext search against data to determine likely misspellings and approximate string matching.

Usage

The usage is simple. Just add a string to the set, and ask for it later by using either .get or []:

>>> a = fuzzyset.FuzzySet()
>>> a.add("michael axiak")
>>> a.get("micael asiak")
[(0.8461538461538461, u'michael axiak')]

The result will be a list of (score, mached_value) tuples. The score is between 0 and 1, with 1 being a perfect match.

For roughly 15% performance increase, there is also a Cython-implemented version called cfuzzyset. So you can write the following, akin to cStringIO and cPickle:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

Construction Arguments

  • iterable: An iterable that yields strings to initialize the data structure with
  • gram_size_lower: The lower bound of gram sizes to use, inclusive (see Theory of operation). Default: 2
  • gram_size_upper: The upper bound of gram sizes to use, inclusive (see Theory of operation). Default: 3
  • use_levenshtein: Whether or not to use the levenshtein distance to determine the match scoring. Default: True

Theory of operation

Adding to the data structure

First let's look at adding a string, 'michaelich' to an empty set. We first break apart the string into n-grams (strings of length n). So trigrams of 'michaelich' would look like:

'-mi'
'mic'
'ich'
'cha'
'hae'
'ael'
'eli'
'lic'
'ich'
'ch-'

Note that fuzzyset will first normalize the string by removing non word characters except for spaces and commas and force everything to be lowercase.

Next the fuzzyset essentially creates a reverse index on those grams. Maintaining a dictionary that says:

'mic' -> (1, 0)
'ich' -> (2, 0)
...

And there's a list that looks like:

[(3.31, 'michaelich')]

Note that we maintain this reverse index for all grams from gram_size_lower to gram_size_upper in the constructor. This becomes important in a second.

Retrieving

To search the data structure, we take the n-grams of the query string and perform a reverse index look up. To illustrate, let's consider looking up 'michael' in our fictitious set containing 'michaelich' where the gram_size_upper and gram_size_lower parameters are default (3 and 2 respectively).

We begin by considering first all trigrams (the value of gram_size_upper). Those grams are:

'-mi'
'mic'
'ich'
'cha'
'el-'

Then we create a list of any element in the set that has at least one occurrence of a trigram listed above. Note that this is just a dictionary lookup 5 times. For each of these matched elements, we compute the cosine similarity between each element and the query string. We then sort to get the most similar matched elements.

If use_levenshtein is false, then we return all top matched elements with the same cosine similarity.

If use_levenshtein is true, then we truncate the possible search space to 50, compute a score based on the levenshtein distance (so that we handle transpositions), and return based on that.

In the event that none of the trigrams matched, we try the whole thing again with bigrams (note though that if there are no matches, the failure to match will be quick). Bigram searching will always be slower because there will be a much larger set to order.

Install

pip install fuzzyset2

Afterwards, you can import the package simply with:

try:
    from cfuzzyset import cFuzzySet as FuzzySet
except ImportError:
    from fuzzyset import FuzzySet

License

BSD

Author

More Repositories

1

pybloomfiltermmap

Fast Python Bloom Filter using Mmap
C
741
star
2

filternet

A proxy library that provides easy hooks to manipulate http and https traffic consistently.
JavaScript
100
star
3

obs-virtual-background

C
25
star
4

java_posix_spawn

Use JNI to implement process spawning without fork()
C
16
star
5

py-rangeset

A rangeset utility for python
HTML
13
star
6

python-function-dispatch

A Haskell inspired function dispatch system for python
Python
10
star
7

remote-webkit-debug

Remote debugging for webkit (protocol v1)
Python
7
star
8

jquery-shiftclick

Easily add shift-click behavior a la gmail to your page
JavaScript
7
star
9

scala-css-parser

Simple css parser in scala using combinator parsing
Scala
7
star
10

jinja-static

static site compilation with asset management /w jinja2
Python
6
star
11

homesearchr

Homesearchr!
JavaScript
5
star
12

tep-lights

Code related to control of tep lights
C
5
star
13

xvidcap-pulseaudio

Xvidcap support for native pulseaudio and maybe alsa
C
4
star
14

speakers

Jupyter Notebook
3
star
15

secret-santa

3
star
16

axedb

Analytical eXtensiblE Database
C++
2
star
17

shihewrite

Shihewrite
Python
2
star
18

rowathon-backend-node

Rowathon node backend.
CoffeeScript
2
star
19

sc2creativity

Jupyter Notebook
2
star
20

ctrowathon

Python
1
star
21

flightsearcher

Python
1
star
22

pydistancematrix

Python
1
star
23

backup

Python
1
star
24

cssurlrewriter

C
1
star
25

wikianalysis

Fun analysis of wikipedia
C++
1
star
26

emacs_config

Emacs Lisp
1
star
27

kata

Some kata projects
1
star
28

trivia-helper

Python
1
star
29

febase

Rust
1
star
30

queuetalk-resources

Jupyter Notebook
1
star
31

mercator-rotate

fun exploration in mercator rotation
1
star
32

cttabletennis

table tennis breakdown
Python
1
star
33

scnotifier

JavaScript
1
star
34

wedding-website

JavaScript
1
star
35

rowathon-desktop-client

Rowathon desktop client for crunchtime rowathon. Designed to talk to rowers and upload data
Python
1
star
36

repanalytics

JavaScript
1
star