• Stars
    star
    195
  • Rank 199,374 (Top 4 %)
  • Language
    Ruby
  • License
    MIT License
  • Created about 10 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Ruby & C implementation of Jaro-Winkler distance algorithm which supports UTF-8 string.

test

jaro_winkler is an implementation of Jaro-Winkler distance algorithm which is written in C extension and will fallback to pure Ruby version in platforms other than MRI/KRI like JRuby or Rubinius. Both of C and Ruby implementation support any kind of string encoding, such as UTF-8, EUC-JP, Big5, etc.

Installation

gem install jaro_winkler

Usage

require 'jaro_winkler'

# Jaro Winkler Distance

JaroWinkler.distance "MARTHA", "MARHTA"
# => 0.9611
JaroWinkler.distance "MARTHA", "marhta", ignore_case: true
# => 0.9611
JaroWinkler.distance "MARTHA", "MARHTA", weight: 0.2
# => 0.9778

# Jaro Distance

JaroWinkler.jaro_distance "MARTHA", "MARHTA"
# => 0.9444444444444445

There is no JaroWinkler.jaro_winkler_distance, it's tediously long.

Options

Name Type Default Note
ignore_case boolean false All lower case characters are converted to upper case prior to the comparison.
weight number 0.1 A constant scaling factor for how much the score is adjusted upwards for having common prefixes.
threshold number 0.7 The prefix bonus is only added when the compared strings have a Jaro distance above the threshold.
adj_table boolean false The option is used to give partial credit for characters that may be errors due to known phonetic or character recognition errors. A typical example is to match the letter "O" with the number "0".

Adjusting Table

Default Table

['A', 'E'], ['A', 'I'], ['A', 'O'], ['A', 'U'], ['B', 'V'], ['E', 'I'], ['E', 'O'], ['E', 'U'], ['I', 'O'], ['I', 'U'],
['O', 'U'], ['I', 'Y'], ['E', 'Y'], ['C', 'G'], ['E', 'F'], ['W', 'U'], ['W', 'V'], ['X', 'K'], ['S', 'Z'], ['X', 'S'],
['Q', 'C'], ['U', 'V'], ['M', 'N'], ['L', 'I'], ['Q', 'O'], ['P', 'R'], ['I', 'J'], ['2', 'Z'], ['5', 'S'], ['8', 'B'],
['1', 'I'], ['1', 'L'], ['0', 'O'], ['0', 'Q'], ['C', 'K'], ['G', 'J'], ['E', ' '], ['Y', ' '], ['S', ' ']

How it works?

Original Formula:

origin

where

  • m is the number of matching characters.
  • t is half the number of transpositions.

With Adjusting Table:

adj

where

  • s is the number of nonmatching but similar characters.

Why This?

There is also another similar gem named fuzzy-string-match which both provides C and Ruby version as well.

I reinvent this wheel because of the naming in fuzzy-string-match such as getDistance breaks convention, and some weird code like a1 = s1.split( // ) (s1.chars could be better), furthermore, it's bugged (see tables below).

Compare with other gems

jaro_winkler fuzzystringmatch hotwater amatch
Encoding Support Yes Pure Ruby only No No
Windows Support Yes ? No Yes
Adjusting Table Yes No No No
Native Yes Yes Yes Yes
Pure Ruby Yes Yes No No
Speed 1st 3rd 2nd 4th

I made a table below to compare accuracy between each gem:

str_1 str_2 origin jaro_winkler fuzzystringmatch hotwater amatch
"henka" "henkan" 0.9667 0.9667 0.9722 0.9667 0.9444
"al" "al" 1.0 1.0 1.0 1.0 1.0
"martha" "marhta" 0.9611 0.9611 0.9611 0.9611 0.9444
"jones" "johnson" 0.8324 0.8324 0.8324 0.8324 0.7905
"abcvwxyz" "cabvwxyz" 0.9583 0.9583 0.9583 0.9583 0.9583
"dwayne" "duane" 0.84 0.84 0.84 0.84 0.8222
"dixon" "dicksonx" 0.8133 0.8133 0.8133 0.8133 0.7667
"fvie" "ten" 0.0 0.0 0.0 0.0 0.0

Benchmark

$ bundle exec rake benchmark
ruby 2.4.1p111 (2017-03-22 revision 58053) [x86_64-darwin16]

# C Extension
Rehearsal --------------------------------------------------------------
jaro_winkler (8c16e09)       0.240000   0.000000   0.240000 (  0.241347)
fuzzy-string-match (1.0.1)   0.400000   0.010000   0.410000 (  0.403673)
hotwater (0.1.2)             0.250000   0.000000   0.250000 (  0.254503)
amatch (0.4.0)               0.870000   0.000000   0.870000 (  0.875930)
----------------------------------------------------- total: 1.770000sec

                                 user     system      total        real
jaro_winkler (8c16e09)       0.230000   0.000000   0.230000 (  0.236921)
fuzzy-string-match (1.0.1)   0.380000   0.000000   0.380000 (  0.381942)
hotwater (0.1.2)             0.250000   0.000000   0.250000 (  0.254977)
amatch (0.4.0)               0.860000   0.000000   0.860000 (  0.861207)

# Pure Ruby
Rehearsal --------------------------------------------------------------
jaro_winkler (8c16e09)       0.440000   0.000000   0.440000 (  0.438470)
fuzzy-string-match (1.0.1)   0.860000   0.000000   0.860000 (  0.862850)
----------------------------------------------------- total: 1.300000sec

                                 user     system      total        real
jaro_winkler (8c16e09)       0.440000   0.000000   0.440000 (  0.439237)
fuzzy-string-match (1.0.1)   0.910000   0.010000   0.920000 (  0.920259)

Todo

  • Custom adjusting word table.

More Repositories

1

exif

The fastest Ruby EXIF reader.
C
150
star
2

xdite

Command line tool that prints famous quotes.
Ruby
35
star
3

brainana_shop

歐付寶 + Rails 串接範例網站
Ruby
29
star
4

tj-markdown-paste

Markdown publishing service with mathematical syntax and automatic code highlight.
Ruby
29
star
5

gistyle

GIStyle is a Rails plug-in for DOM-based routing of Javascript, inspired from Paul Irish and Jason Garber.
Ruby
24
star
6

sonycam

Sony Camera API Wrapper
Ruby
19
star
7

allpay

歐付寶 API 的 Ruby 包裝。
Ruby
19
star
8

easycard

台灣悠遊卡交易紀錄查詢工具/A search tool for Taiwan EasyCard
Ruby
15
star
9

TJDict

TJDict is a fast, easy, open-source, and ad-free browser extension for multi-dictionary searching.
JavaScript
14
star
10

lagdown

An open source blogging system
JavaScript
14
star
11

guest-chat

(前)五倍紅寶石匿名聊天室
Ruby
12
star
12

tjstamp

A tool for generating Chinese characters stamp image
Ruby
12
star
13

rr3

Ruby wrapper of c9s/r3
C
8
star
14

libpuzzle_ruby

This is a C extension for libpuzzle to find similar pictures.
Ruby
8
star
15

sfacg_downloader

SFACG Comic Downloader
Ruby
7
star
16

good_lock

Ruby
7
star
17

rails_gulp_bower

A demonstration of integrating gulp, bower and rails without any 3rd party gem
HTML
7
star
18

hahadict

教育部重編國語辭典修訂本 Chrome 離線版
JavaScript
6
star
19

TJNGram

N-Gram generator in Ruby, supporting English, Chinese, Janpanese and Korean.
Ruby
6
star
20

tonytonyjan.github.io

My blog
HTML
5
star
21

rack_encrypted_cookie

Rack middleware for signed encrypted session cookie
Ruby
4
star
22

admin_scaffold

admin scaffold generator
Ruby
4
star
23

react_on_rails

Rails + Webpack + React + Redux + React Router + Server Rendering without other gems
Ruby
4
star
24

geodesics

geodesics calculates the geodesic distance between 2 points with latitude and longitude on ellipsoid Earth using Lambert's formula.
Ruby
4
star
25

octopress-indexer-demo

Demo of indexer Octopress plugin.
JavaScript
4
star
26

fc2_get

FC2 Video Downloader
Ruby
3
star
27

treerful_scanner

小樹屋掃描器 - 找出特定日期所有小樹屋的時間表
Ruby
3
star
28

taiwan-id

Taiwan ID checker and generator.
JavaScript
2
star
29

my_cart

This repo is used for teaching in my class.
Ruby
2
star
30

plist_lite

plist_lite the fastest plist processor for Ruby written in C.
Ruby
2
star
31

material_design_icons_rails

Rails plugin for Google Material Design Icons
CSS
2
star
32

cover_rage

A Ruby production code coverage tool designed to assist you in identifying unused code
Ruby
2
star
33

rfc_2047

A Ruby implementation of RFC 2047
Ruby
2
star
34

ruby-cookbook

A chef cookbook to download, compile and install MRI Ruby.
Ruby
1
star
35

q_and_a

A "question and answer" Rails application architected with Clean Architecture
Ruby
1
star
36

smalldo-robot

1
star
37

past-paper-crawlers

PHP
1
star
38

nctu-course-plus

JavaScript
1
star
39

gym_finder

台灣運動中心場地租用狀況查詢器
Ruby
1
star
40

brainana.com

Brainana Studio official site
CSS
1
star
41

zhrb

Program Ruby in Chinese
Ruby
1
star
42

spritz

A command line tool for spritzing.
Ruby
1
star
43

TJChatbot

Ruby
1
star
44

tjplurk

Plurk API wrapper
Ruby
1
star
45

tj-bootstrap-helper

Tony Jian's Bootstrap helper.
Ruby
1
star
46

timestamp_maker

timestamp_maker is a command-line tool that adds timestamp on images/videos based on their creation time.
Ruby
1
star
47

rails_on_webpack

This product is a demonstration of integration between Rails and Webpack without using any ruby gem.
Ruby
1
star
48

sql_boolean_vs_timestamp

Shell
1
star
49

ruby-conf-tw-twitter-wall

JavaScript
1
star
50

colored_man

A fisherman plugin for fish shell that colorize your man page.
Shell
1
star