• Stars
    star
    37
  • Rank 716,386 (Top 15 %)
  • Language
    R
  • License
    GNU General Publi...
  • Created almost 6 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automated Detection of Regular Expression Patterns

regexmagic

The goal of regexmagic is to provide an automated method for classifying a vector of strings into groupings based on regex matches. This differs from finding matches to a known regex within a vector, rather this helps determine commonalities between strings.

Installation

You can install the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("jonocarroll/regexmagic")

Example

Given the vector of strings in the (provided) example data, this package will determine the groupings by regex

data(identifiers)
print(identifiers)
#>  [1] "XY-27121"     "AB.312.Z0_0"  "XX-00000"     "XY-20687"    
#>  [5] "50006955595R" "50000000000X" "XY-92612"     "50095973410R"
#>  [9] "50066227417R" "XY-86755"     "50018372252R" "AB.122.Z0_0" 
#> [13] "AB.935.Z0_1"  "XY-70476"     "50050222847R" "XY-74486"    
#> [17] "50015512791R" "XY-92436"     "50071469441R" "XY-67174"    
#> [21] "XY-47337"     "50095731925R" "50063296214R" "XY-21637"    
#> [25] "AB.010.Z0_1"  "AB.243.Z0_1"  "AB.363.Z0_1"  "XY-48420"    
#> [29] "AB.464.Z0_0"  "AB.424.Z0_0"  "AB.952.Z0_0"  "AB.654.Z0_0" 
#> [33] "XY-47937"     "AB.483.Z0_0"  "AB.391.Z0_1"  "AB.604.Z0_0" 
#> [37] "AX.000.Z0_0"  "50074522550R" "XY-89660"     "AB.898.Z0_1" 
#> [41] "50084037368R" "XY-03564"     "50079836993R" "AB.610.Z0_0" 
#> [45] "AB.214.Z0_1"  "AB.872.Z0_0"  "AB.497.Z0_1"  "AB.532.Z0_1" 
#> [49] "XY-30383"     "XY-24708"     "AB.213.Z0_1"  "XY-45418"    
#> [53] "AB.039.Z0_1"  "XY-88379"     "AB.634.Z0_1"  "AB.013.Z0_0" 
#> [57] "XY-38334"     "50018653451R" "AB.041.Z0_0"  "50021858177R"
#> [61] "XY-23592"     "AB.359.Z0_0"  "AB.058.Z0_0"  "50083386769R"
#> [65] "AB.710.Z0_1"

Within this example data there are 3 distinct patterns, along with 3 identifers which do not match exactly these patterns (confounders). The goal is to produce a package which can detect the common patterns and sort the identifiers into the correct groups.

Methodology

First, common substrings are identified. These are allowed some tolerance by which the identifiers may deviate. By default this is 95% of samples should match the pattern.

With the example data, we can detect the common substrings

purrr::map(split_by_length(identifiers), find_common_substrings)
#> $`8`
#> [1] "XY-#####"
#> 
#> $`11`
#> [1] "AB.###.Z0_#"
#> 
#> $`12`
#> [1] "500#########"

This has become confused by the confounder in each group which destroys the perfect relationship. We can improve this by lowering the tolerance

(guess <- purrr::map(split_by_length(identifiers), find_common_substrings, tolerance = 0.9))
#> $`8`
#> [1] "XY-#####"
#> 
#> $`11`
#> [1] "AB.###.Z0_#"
#> 
#> $`12`
#> [1] "500########R"

This successfully identifies the common substrings, but leaves the pattern determining the missing parts unknown. Next we need to determine common patterns for these. We can search some given patterns to see if each character matches this enough times. The pre-defined patterns are

known_patterns
#> [1] "[0-9]"       "[A-Z]"       "[[:punct:]]"

Appying these, one character at a time, and seeing which pattern matches the most number of characters at a position, we can determine which pattern best fits at that position

(guess <- purrr::map(split_by_length(identifiers), detect_pattern, tolerance = 0.9))
#> $`8`
#> [1] "XY-[0-9][0-9][0-9][0-9][0-9]"
#> 
#> $`11`
#> [1] "AB\\.[0-9][0-9][0-9]\\.Z0_[0-9]"
#> 
#> $`12`
#> [1] "500[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]R"

How many identifiers match these patterns?

matches <- purrr::map2(split_by_length(identifiers), guess, ~stringr::str_match(.x, .y))
names(matches) <- guess
matches
#> $`XY-[0-9][0-9][0-9][0-9][0-9]`
#>       [,1]      
#>  [1,] "XY-27121"
#>  [2,] NA        
#>  [3,] "XY-20687"
#>  [4,] "XY-92612"
#>  [5,] "XY-86755"
#>  [6,] "XY-70476"
#>  [7,] "XY-74486"
#>  [8,] "XY-92436"
#>  [9,] "XY-67174"
#> [10,] "XY-47337"
#> [11,] "XY-21637"
#> [12,] "XY-48420"
#> [13,] "XY-47937"
#> [14,] "XY-89660"
#> [15,] "XY-03564"
#> [16,] "XY-30383"
#> [17,] "XY-24708"
#> [18,] "XY-45418"
#> [19,] "XY-88379"
#> [20,] "XY-38334"
#> [21,] "XY-23592"
#> 
#> $`AB\\.[0-9][0-9][0-9]\\.Z0_[0-9]`
#>       [,1]         
#>  [1,] "AB.312.Z0_0"
#>  [2,] "AB.122.Z0_0"
#>  [3,] "AB.935.Z0_1"
#>  [4,] "AB.010.Z0_1"
#>  [5,] "AB.243.Z0_1"
#>  [6,] "AB.363.Z0_1"
#>  [7,] "AB.464.Z0_0"
#>  [8,] "AB.424.Z0_0"
#>  [9,] "AB.952.Z0_0"
#> [10,] "AB.654.Z0_0"
#> [11,] "AB.483.Z0_0"
#> [12,] "AB.391.Z0_1"
#> [13,] "AB.604.Z0_0"
#> [14,] NA           
#> [15,] "AB.898.Z0_1"
#> [16,] "AB.610.Z0_0"
#> [17,] "AB.214.Z0_1"
#> [18,] "AB.872.Z0_0"
#> [19,] "AB.497.Z0_1"
#> [20,] "AB.532.Z0_1"
#> [21,] "AB.213.Z0_1"
#> [22,] "AB.039.Z0_1"
#> [23,] "AB.634.Z0_1"
#> [24,] "AB.013.Z0_0"
#> [25,] "AB.041.Z0_0"
#> [26,] "AB.359.Z0_0"
#> [27,] "AB.058.Z0_0"
#> [28,] "AB.710.Z0_1"
#> 
#> $`500[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]R`
#>       [,1]          
#>  [1,] "50006955595R"
#>  [2,] NA            
#>  [3,] "50095973410R"
#>  [4,] "50066227417R"
#>  [5,] "50018372252R"
#>  [6,] "50050222847R"
#>  [7,] "50015512791R"
#>  [8,] "50071469441R"
#>  [9,] "50095731925R"
#> [10,] "50063296214R"
#> [11,] "50074522550R"
#> [12,] "50084037368R"
#> [13,] "50079836993R"
#> [14,] "50018653451R"
#> [15,] "50021858177R"
#> [16,] "50083386769R"

Working Prototype

Now that the pieces seem to work, we can apply the categorisations in a function, returning (invisibly) a list of matches and non-matches, and printing a summary to the screen

results <- categorise_regex(identifiers, tolerance = 0.9)
#>    ** CATEGORISATION SUMMARY **
#>    ** Detected 3 categories and matched
#>     62 / 65 ( 0.954% ) strings **
#>   nchar: 8
#> example: XY-27121
#>   regex: XY-[0-9][0-9][0-9][0-9][0-9]
#>   match: 20 / 21 ( 95.2% )
#>   nchar: 11
#> example: AB.312.Z0_0
#>   regex: AB\.[0-9][0-9][0-9]\.Z0_[0-9]
#>   match: 27 / 28 ( 96.4% )
#>   nchar: 12
#> example: 50006955595R
#>   regex: 500[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]R
#>   match: 15 / 16 ( 93.8% )

where we see the single confounder in each case is not matched.

The actual categorisations are also available

results
#> $`8`
#> $`8`$regex
#> [1] "XY-[0-9][0-9][0-9][0-9][0-9]"
#> 
#> $`8`$matches
#>  [1] "XY-27121" "XY-20687" "XY-92612" "XY-86755" "XY-70476" "XY-74486"
#>  [7] "XY-92436" "XY-67174" "XY-47337" "XY-21637" "XY-48420" "XY-47937"
#> [13] "XY-89660" "XY-03564" "XY-30383" "XY-24708" "XY-45418" "XY-88379"
#> [19] "XY-38334" "XY-23592"
#> 
#> $`8`$nonmatches
#> [1] "XX-00000"
#> 
#> 
#> $`11`
#> $`11`$regex
#> [1] "AB\\.[0-9][0-9][0-9]\\.Z0_[0-9]"
#> 
#> $`11`$matches
#>  [1] "AB.312.Z0_0" "AB.122.Z0_0" "AB.935.Z0_1" "AB.010.Z0_1" "AB.243.Z0_1"
#>  [6] "AB.363.Z0_1" "AB.464.Z0_0" "AB.424.Z0_0" "AB.952.Z0_0" "AB.654.Z0_0"
#> [11] "AB.483.Z0_0" "AB.391.Z0_1" "AB.604.Z0_0" "AB.898.Z0_1" "AB.610.Z0_0"
#> [16] "AB.214.Z0_1" "AB.872.Z0_0" "AB.497.Z0_1" "AB.532.Z0_1" "AB.213.Z0_1"
#> [21] "AB.039.Z0_1" "AB.634.Z0_1" "AB.013.Z0_0" "AB.041.Z0_0" "AB.359.Z0_0"
#> [26] "AB.058.Z0_0" "AB.710.Z0_1"
#> 
#> $`11`$nonmatches
#> [1] "AX.000.Z0_0"
#> 
#> 
#> $`12`
#> $`12`$regex
#> [1] "500[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]R"
#> 
#> $`12`$matches
#>  [1] "50006955595R" "50095973410R" "50066227417R" "50018372252R"
#>  [5] "50050222847R" "50015512791R" "50071469441R" "50095731925R"
#>  [9] "50063296214R" "50074522550R" "50084037368R" "50079836993R"
#> [13] "50018653451R" "50021858177R" "50083386769R"
#> 
#> $`12`$nonmatches
#> [1] "50000000000X"

Yet To Do

  • reduce ‘runs’ of patterns, e.g. [0-9][0-9] to [0-9]{2}
  • find shortest regex which matches, e.g. [AB] vs [A-Z]
  • variable-length identifiers
  • multiple identifiers with a given length
  • most testing
  • documentation

More Repositories

1

ggeasy

ggplot2 shortcuts (transformations made easy)
HTML
267
star
2

mathpix

Query the mathpix API to convert math images to LaTeX
R
244
star
3

ggshape

Arrange 'ggplot' facets in arbitrary shapes
R
79
star
4

ntfy

Lightweight Wrapper to the ntfy.sh Service
JavaScript
60
star
5

ggghost

👻 Capture the spirit of your ggplot call
R
49
star
6

purrr2for

Automagically Convert purrr R Calls to Efficient Julia for Loops
R
49
star
7

githubtools

Tools to complement building and using R packages installed from GitHub
R
24
star
8

dash

RStudio Addin to Run a Selection as a Background Job
R
22
star
9

DFplyr

A `DataFrame` (`S4Vectors`) backend for `dplyr`
R
20
star
10

importAs

Idiomatic Python Shorthand Imports
R
20
star
11

charcuterie

Handle Strings as Vectors of Characters
R
19
star
12

realtime

R
15
star
13

TriangulArt.jl

Artify images using Delaunay Triangulation
Julia
15
star
14

butteRfly

Build a social network dashboard in R (Twitter/Facebook/GitHub/etc...)
R
14
star
15

starryeyes

"Oh my God! — it's full of stars!"
R
12
star
16

tidyGDPR

General Data Protection Regulation as Tidy Object(s)
R
11
star
17

VolunteerVignettes

The world needs more #rstats vignettes, so I'll be the change I want to see
R
9
star
18

runkeepR

Extract, plot, and analyse Runkeeper(TM) data.
R
9
star
19

opr

🔐 Interact with the 1Password CLI tool 'op'
R
7
star
20

btts

We need to go... Back To The Source
R
6
star
21

22degrees

Attempts to investigate moon halo star counts vs rain events in R
R
5
star
22

jcarroll.com.au

HTML
5
star
23

rx86

Simulated x86 assembly processing
R
4
star
24

chaint

"chain with a tee" -- add tee functions to magrittr/dplyr chains
R
4
star
25

advent-of-code

My Advent of Code Solutions in R and Rust
R
3
star
26

bluey

Episode scripts for the TV show Bluey scraped from the fandom site
Python
3
star
27

AUelection2016

Analysis of the 2016 Australian Federal Election with flexdashboard
HTML
3
star
28

itdepends

Is It Sufficient To Use Imports?
R
2
star
29

rps.rs

Rock, Paper, Scissors game, demonstrating enums
Rust
1
star
30

RDataGovAU

Access data.gov.au open data sets with R
R
1
star
31

jonocarroll

Jonathan Carroll
1
star
32

SAHansard

Wraps the SA Parliament Hansard API
CSS
1
star
33

FEN.rs

[toy project] FEN chess notation parser in Rust
Rust
1
star
34

useR2022

CSS
1
star
35

chessclub

R
1
star
36

weasel

Provides `pop()` and `push()` functionality
R
1
star