• Stars
    star
    40
  • Rank 655,689 (Top 14 %)
  • Language
    R
  • License
    Other
  • Created over 6 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

String distance calculation the tidy way.

Coverage Status

Travis-CI Build Status

tidystringdist

Compute string distance the tidy way. Built on top of the ‘stringdist’ package.

Install tidystringdist

You’ll get the dev version on:

devtools::install_github("ColinFay/tidystringdist")

Stable version is available with :

install.packages("tidystringdist")

tidystringdist basic workflow

tidycomb

First, you need to create a tibble with the combinations of words you want to compare. You can do this with the tidy_comb and tidy_comb_all functions. The first takes a base word and combines it with each elements of a list or a column of a data.frame, the 2nd combines all the possible couples from a list or a column.

If you already have a data.frame with two columns containing the strings to compare, you can skip this part.

library(tidystringdist)

tidy_comb_all(LETTERS[1:3])
#> # A tibble: 3 x 2
#>      V1    V2
#> * <chr> <chr>
#> 1     A     B
#> 2     A     C
#> 3     B     C
tidy_comb_all(iris, Species)
#> # A tibble: 3 x 2
#>           V1         V2
#> *      <chr>      <chr>
#> 1     setosa versicolor
#> 2     setosa  virginica
#> 3 versicolor  virginica
tidy_comb("Paris", state.name[1:3])
#> # A tibble: 3 x 2
#>        V1    V2
#> *   <chr> <chr>
#> 1 Alabama Paris
#> 2  Alaska Paris
#> 3 Arizona Paris

tidy_string_dist

Once you’ve got this data.frame, you can use tidy_string_dist to compute string distance. This function takes a data.frame, the two columns containing the strings, and a stringdist method.

Note that if you’ve used the tidy_comb function to create you data.frame, you won’t need to set the column names.

library(dplyr)
data(starwars)
tidy_comb_sw <- tidy_comb_all(starwars, name)
tidy_stringdist(tidy_comb_sw)
#> Warning in do_dist(a = b, b = a, method = method, weight = weight, maxDist
#> = maxDist, : Non-printable ascii or non-ascii characters in soundex.
#> Results may be unreliable. See ?printable_ascii.
#> # A tibble: 3,741 x 12
#>                V1                 V2   osa    lv    dl hamming   lcs qgram
#>  *          <chr>              <chr> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
#>  1 Luke Skywalker              C-3PO    14    14    14     Inf    19    19
#>  2 Luke Skywalker              R2-D2    14    14    14     Inf    19    19
#>  3 Luke Skywalker        Darth Vader    11    11    11     Inf    17    17
#>  4 Luke Skywalker        Leia Organa    11    11    11     Inf    17    15
#>  5 Luke Skywalker          Owen Lars    12    12    12     Inf    15    11
#>  6 Luke Skywalker Beru Whitesun lars    16    16    16     Inf    22    18
#>  7 Luke Skywalker              R5-D4    14    14    14     Inf    19    19
#>  8 Luke Skywalker  Biggs Darklighter    13    13    13     Inf    21    19
#>  9 Luke Skywalker     Obi-Wan Kenobi    14    14    14      14    24    22
#> 10 Luke Skywalker   Anakin Skywalker     5     5     5     Inf     8     8
#> # ... with 3,731 more rows, and 4 more variables: cosine <dbl>,
#> #   jaccard <dbl>, jw <dbl>, soundex <dbl>

Default call compute all the methods. You can use specific method with the method argument:

tidy_stringdist(tidy_comb_sw, method = c("osa","jw"))
#> # A tibble: 3,741 x 4
#>                V1                 V2   osa        jw
#>  *          <chr>              <chr> <dbl>     <dbl>
#>  1 Luke Skywalker              C-3PO    14 1.0000000
#>  2 Luke Skywalker              R2-D2    14 1.0000000
#>  3 Luke Skywalker        Darth Vader    11 0.5752165
#>  4 Luke Skywalker        Leia Organa    11 0.5335498
#>  5 Luke Skywalker          Owen Lars    12 0.4624339
#>  6 Luke Skywalker Beru Whitesun lars    16 0.4656085
#>  7 Luke Skywalker              R5-D4    14 1.0000000
#>  8 Luke Skywalker  Biggs Darklighter    13 0.5728291
#>  9 Luke Skywalker     Obi-Wan Kenobi    14 0.6349206
#> 10 Luke Skywalker   Anakin Skywalker     5 0.2816558
#> # ... with 3,731 more rows

Tidyverse workflow

The goal is to provide a convenient interface to work with other tools from the tidyverse.

tidy_stringdist(tidy_comb_sw, method= "osa") %>%
  filter(osa > 20) %>%
  arrange(desc(osa))
#> # A tibble: 11 x 3
#>                       V1                    V2   osa
#>                    <chr>                 <chr> <dbl>
#>  1                 C-3PO Jabba Desilijic Tiure    21
#>  2                 C-3PO Wicket Systri Warrick    21
#>  3                 R2-D2 Wicket Systri Warrick    21
#>  4                 R5-D4 Wicket Systri Warrick    21
#>  5 Jabba Desilijic Tiure                 IG-88    21
#>  6 Jabba Desilijic Tiure                 Cordé    21
#>  7 Jabba Desilijic Tiure                R4-P17    21
#>  8 Jabba Desilijic Tiure                   BB8    21
#>  9                 IG-88 Wicket Systri Warrick    21
#> 10 Wicket Systri Warrick                R4-P17    21
#> 11 Wicket Systri Warrick                   BB8    21
starwars %>%
  filter(species == "Droid") %>%
  tidy_comb_all(name) %>%
  tidy_stringdist() %>% 
  summarise_if(is.numeric, mean)
#> # A tibble: 1 x 10
#>     osa    lv    dl hamming   lcs qgram    cosine   jaccard        jw
#>   <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>     <dbl>     <dbl>     <dbl>
#> 1   4.4   4.4   4.4     Inf   7.4   7.4 0.8304896 0.8671032 0.6422222
#> # ... with 1 more variables: soundex <dbl>

Contact

Questions and feedbacks welcome!

More Repositories

1

attempt

Tools for defensive programming in R
R
122
star
2

nessy

A 'NES' css for 'Shiny'
R
104
star
3

brochure

[WIP] Natively Multipage Shiny Apps
R
101
star
4

hexmake

A Shiny App for Making Hex Stickers.
R
90
star
5

erum2018

"Building a package that lasts" — eRum 2018 workshop
78
star
6

hordes

R from NodeJS, the right way.
JavaScript
58
star
7

aside

Send a long R job to be run aside
R
57
star
8

golemexamples

Gathering in one place some {golem} examples
R
55
star
9

bubble

Launch and interact with a NodeJS session from R
HTML
52
star
10

backyard

A Web App for Easier Bookdown Collaboration
R
50
star
11

conf

Slides from various conferences
R
49
star
12

gargoyle

Event-Based Structures for 'Shiny'
R
48
star
13

golemize

Example of turning apps to golem
JavaScript
46
star
14

fryingpane

Serve datasets from a package inside the RStudio Connection Pane.
R
41
star
15

glouton

'JS-cookies' in Shiny
R
40
star
16

user2019workshop

38
star
17

tidytuesday201942

A golem App for #TidyTuesday, 2019-10-15
HTML
38
star
18

geoloc

Add geolocation inside your shiny app
R
38
star
19

craneur

Create your own R Archive Network
HTML
36
star
20

argh

Hey, Everybody Makes Mistakes
R
36
star
21

resume

Bootstrap Resume Template for Shiny
R
32
star
22

purrr-cookbook

[Work In Progress] A cookbook of purrr recipes
HTML
30
star
23

crrry

'crrri' recipes for 'shiny'
R
28
star
24

r-ci

Docker images for Continous Integration / Continuous Delivery for R Projects
Dockerfile
26
star
25

feathericons

Feather Icons for Shiny
R
24
star
26

golemexample

An example app for illustrating golem features
R
24
star
27

proustr

Tools for Natural Language Processing in French and texts from Marcel Proust's collection "A La Recherche Du Temps Perdu"
R
24
star
28

handydandy

Easy CSS Styling for Shiny
R
22
star
29

darkmode

'darkmode.js' for 'Shiny'
R
19
star
30

frankenstein

Bring your Shiny App back from the dead
R
18
star
31

haddock

[WIP - DO NOT USE] A Shiny Server written in Node JS
JavaScript
18
star
32

chuck

10x Shiny App with Chuck Norris jokes
R
18
star
33

ronline

A NodeJS app to explore multiple versions of R
JavaScript
18
star
34

mdlinks

A Google Chrome extension to create Markdown links for the current page
JavaScript
17
star
35

odds

On Disk Data Storage for Cross-Session Access in R
R
16
star
36

jekyllthat

RMarkdown to Github Jekyll md
R
15
star
37

r-db

[WIP] A Docker image w/ the whole stack of packages from the CRAN task view "Databases"
HTML
15
star
38

tweetthat

A simple wrapper to tweet straight from your R session.
R
14
star
39

rpinterest

An R package to access the Pinterest API
R
11
star
40

rgeoapi

This package requests informations from the french GéoAPI inside R — https://api.gouv.fr/api/geoapi.html
R
10
star
41

writing-r-extensions

"Writing R Extensions" manual as a bookdown
R
8
star
42

skeleton

Skeleton CSS for Shiny
R
8
star
43

dockerstats

R Wrapper Around 'docker stats'
R
7
star
44

debugin

An RStudio Addin for Debugging
R
7
star
45

ariel

Access the SIRENE API from R
R
7
star
46

minifying

An Application to Minify CSS, JAVASCRIPT, CSS, and HTML files
R
6
star
47

golem-joburg

satRday Johannesburg golem Workshop
R
6
star
48

minifyr

Wrapper around node-minify NodeJS module
R
6
star
49

worrkout

Generate and post workouts as a issue to a GitHub repo.
R
6
star
50

noon

Watch MIDI Events from R
R
5
star
51

languagelayeR

Access the languagelayer API with R
HTML
5
star
52

rrocketchat

R API wrapper for Rocket.Chat
R
4
star
53

rnotify

A Wrapper Around the 'node-notify-cli' module, in R.
R
4
star
54

clientsdb

A docker image with a client review database built with postgre, to be used for teaching.
R
4
star
55

ur-first-5k

Running your first 5K by closing GitHub issues
R
4
star
56

lexiquer

Access Lexique3.81, a Natural Language Processing Database for French
R
4
star
57

colinfay.github.io

website
HTML
4
star
58

rfeel

A Wrapper for the FEEL lexicon
R
3
star
59

here.js

Finding your files in NodeJS — Port of the {here} R package
JavaScript
3
star
60

tuRbonegro

[Just for fun] Plays a random Turbonegro clip in your R Viewer
R
3
star
61

r-internals

"R Internals" manual as a bookdown
R
3
star
62

daw

R
2
star
63

LoremJulia

A basic lorem ipsum generator made in Julia.
Julia
2
star
64

wikileaksdm

Wikileaks Twitter DMs leak as a browsable and reusable format
HTML
2
star
65

rstudiosnippets

Random RStudio Snippets
2
star
66

cordes

[WIP] Boilerplate for Wrapping Node Modules in R packages
R
2
star
67

wtfismyip

A simple, dependency free wrapper around wtfismyip
R
2
star
68

website

Personnal Website
CSS
2
star
69

golem4bench

A very simple golem-based package, made for benchmark
R
1
star
70

orderdiv

R
1
star
71

majordome

[WIP] Manage Remote 'RConnect' and 'RStudio Package Manger' from your R session
R
1
star
72

r-language-definition

"R Language Definition" manual as a bookdown
R
1
star
73

osgridfolio

[WIP] A dead simple grid portfolio to display your GitHub projects, written in pure CSS & vanilla JS
CSS
1
star
74

gloup

glop
R
1
star
75

colinfay

1
star
76

webrspongebob

Example repo of an app built with `webrcli` & `spidyr`
JavaScript
1
star
77

intro-to-r

"Intro to R" manual as a bookdown
HTML
1
star
78

r-devel-doc

Documenting the process of submitting a bug fix to R
Shell
1
star