• Stars
    star
    158
  • Rank 237,131 (Top 5 %)
  • Language
    R
  • License
    GNU General Publi...
  • Created over 5 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Extract matched substrings using a pattern, similar to what package glue does in reverse

Codecov test coverage

unglue

The package unglue features functions such as unglue(), unglue_data() and unglue_unnest() which provide in many cases a more readable alternative to base regex functions. Simple cases indeed don’t require regex knowledge at all.

It uses a syntax inspired from the functions of Jim Hester’s glue package to extract matched substrings using a pattern, but is not endorsed by the authors of glue nor tidyverse packages.

It is completely dependency free, though formula notation of functions is supported if rlang is installed.

Installation:

# CRAN version:
install.packages("unglue")
# Development version:
remotes::install_github("moodymudskipper/unglue")

using an example from ?glue::glue backwards

library(unglue)
library(glue)
library(magrittr)
library(utils)
glued_data <- head(mtcars) %>% glue_data("{rownames(.)} has {hp} hp")
glued_data
#> Mazda RX4 has 110 hp
#> Mazda RX4 Wag has 110 hp
#> Datsun 710 has 93 hp
#> Hornet 4 Drive has 110 hp
#> Hornet Sportabout has 175 hp
#> Valiant has 105 hp
unglue_data(glued_data, "{rownames(.)} has {hp} hp")
#>         rownames...  hp
#> 1         Mazda RX4 110
#> 2     Mazda RX4 Wag 110
#> 3        Datsun 710  93
#> 4    Hornet 4 Drive 110
#> 5 Hornet Sportabout 175
#> 6           Valiant 105

use several patterns, the first that matches will be used

facts <- c("Antarctica is the largest desert in the world!",
"The largest country in Europe is Russia!",
"The smallest country in Europe is Vatican!",
"Disneyland is the most visited place in Europe! Disneyland is in Paris!",
"The largest island in the world is Green Land!")
facts_df <- data.frame(id = 1:5, facts)

patterns <- c("The {adjective} {place_type} in {bigger_place} is {place}!",
            "{place} is the {adjective} {place_type=[^ ]+} in {bigger_place}!{=.*}")
unglue_data(facts, patterns)
#>        place    adjective place_type bigger_place
#> 1 Antarctica      largest     desert    the world
#> 2     Russia      largest    country       Europe
#> 3    Vatican     smallest    country       Europe
#> 4 Disneyland most visited      place       Europe
#> 5 Green Land      largest     island    the world

Note that the second pattern uses some regex, regex needs to be typed after an = sign, if its has no left hand side then the expression won’t be attributed to a variable. in fact the pattern "{foo}" is a shorthand for "{foo=.*?}".

escaping characters

Special characters outside of the curly braces should not be escaped.

sentences <- c("666 is [a number]", "foo is [a word]", "42 is [the answer]", "Area 51 is [unmatched]")
patterns2 <- c("{number=\\d+} is [{what}]", "{word=\\D+} is [{what}]")
unglue_data(sentences, patterns2)
#>   number       what word
#> 1    666   a number <NA>
#> 2   <NA>     a word  foo
#> 3     42 the answer <NA>
#> 4   <NA>       <NA> <NA>

type conversion

In order to convert types automatically we can set convert = TRUE, in the example above the column number will be converted to numeric.

unglue_data(sentences, patterns2, convert = TRUE)
#>   number       what word
#> 1    666   a number <NA>
#> 2     NA     a word  foo
#> 3     42 the answer <NA>
#> 4     NA       <NA> <NA>

convert = TRUE triggers the use of utils::type.convert with parameter as.is = TRUE. We can also set convert to another conversion function such as readr::type_convert, or to a formula is rlang is installed.

unglue_unnest()

unglue_unnest() is named as a tribute to tidyr::unnest() as it’s equivalent to using successively unglue() and unnest() on a data frame column. It is similar to tidyr::extract() in its syntax and efforts were made to make it as consistent as possible.

unglue_unnest(facts_df, facts, patterns)
#>   id      place    adjective place_type bigger_place
#> 1  1 Antarctica      largest     desert    the world
#> 2  2     Russia      largest    country       Europe
#> 3  3    Vatican     smallest    country       Europe
#> 4  4 Disneyland most visited      place       Europe
#> 5  5 Green Land      largest     island    the world
unglue_unnest(facts_df, facts, patterns, remove = FALSE)
#>   id                                                                   facts
#> 1  1                          Antarctica is the largest desert in the world!
#> 2  2                                The largest country in Europe is Russia!
#> 3  3                              The smallest country in Europe is Vatican!
#> 4  4 Disneyland is the most visited place in Europe! Disneyland is in Paris!
#> 5  5                          The largest island in the world is Green Land!
#>        place    adjective place_type bigger_place
#> 1 Antarctica      largest     desert    the world
#> 2     Russia      largest    country       Europe
#> 3    Vatican     smallest    country       Europe
#> 4 Disneyland most visited      place       Europe
#> 5 Green Land      largest     island    the world

unglue_vec()

While unglue() returns a list of data frames, unglue_vec() returns a character vector (unless convert = TRUE), if several matches are found in a string the extracted match will be chosen by name or by position.

unglue_vec(sentences, patterns2, "number")
#> [1] "666" NA    "42"  NA
unglue_vec(sentences, patterns2, 1)
#> [1] "666" "foo" "42"  NA

unglue_detect()

unglue_detect() returns a logical vector, it’s convenient to check that the input was matched by a pattern, or to subset the input to take a look at unmatched elements.

unglue_detect(sentences, patterns2)
#> [1]  TRUE  TRUE  TRUE FALSE
subset(sentences, !unglue_detect(sentences, patterns2))
#> [1] "Area 51 is [unmatched]"

unglue_regex()

unglue_regex() returns a character vector of regex patterns, all over functions are wrapped around it and it can be used to leverage the unglue in other functions.

unglue_regex(patterns)
#>            The {adjective} {place_type} in {bigger_place} is {place}! 
#>                                "^The (.*?) (.*?) in (.*?) is (.*?)!$" 
#> {place} is the {adjective} {place_type=[^ ]+} in {bigger_place}!{=.*} 
#>                            "^(.*?) is the (.*?) ([^ ]+) in (.*?)!.*$"
unglue_regex(patterns, named_capture = TRUE)
#>                                 The {adjective} {place_type} in {bigger_place} is {place}! 
#>     "^The (?<adjective>.*?) (?<place_type>.*?) in (?<bigger_place>.*?) is (?<place>.*?)!$" 
#>                      {place} is the {adjective} {place_type=[^ ]+} in {bigger_place}!{=.*} 
#> "^(?<place>.*?) is the (?<adjective>.*?) (?<place_type>[^ ]+) in (?<bigger_place>.*?)!.*$"
unglue_regex(patterns, attributes = TRUE)
#>            The {adjective} {place_type} in {bigger_place} is {place}! 
#>                                "^The (.*?) (.*?) in (.*?) is (.*?)!$" 
#> {place} is the {adjective} {place_type=[^ ]+} in {bigger_place}!{=.*} 
#>                            "^(.*?) is the (.*?) ([^ ]+) in (.*?)!.*$" 
#> attr(,"groups")
#> attr(,"groups")$`The {adjective} {place_type} in {bigger_place} is {place}!`
#>    adjective   place_type bigger_place        place 
#>            1            2            3            4 
#> 
#> attr(,"groups")$`{place} is the {adjective} {place_type=[^ ]+} in {bigger_place}!{=.*}`
#>        place    adjective   place_type bigger_place 
#>            1            2            3            4

unglue_sub()

unglue_sub() substitute substrings using strings or replacement functions

unglue_sub(
  c("a and b", "foo or BAR"),
  c("{x} and {y}", "{x} or {z}"),
  list(x= "XXX", y = toupper, z = ~tolower(.)))
#> [1] "XXX and B"  "XXX or bar"

duplicated labels

We can ensure that a pattern is repeated by repeating its label

unglue_data(c("black is black","black is dark"), "{color} is {color}")
#>   color
#> 1 black
#> 2  <NA>

We can change this behavior by feeding a function to the multiple parameter, in that case this function will be applied on the matches.

unglue_data(c("System: Windows, Version: 10","System: Ubuntu, Version: 18"), 
            "System: {OS}, Version: {OS}", multiple = paste)
#>           OS
#> 1 Windows 10
#> 2  Ubuntu 18

More Repositories

1

flow

View and Browse Code Using Flow Diagrams
R
397
star
2

typed

Support Types for Variables, Arguments, and Return Values
R
159
star
3

boomer

Debugging Tools to Inspect the Intermediate Steps of a Call
R
134
star
4

powerjoin

Extensions of 'dplyr' and 'fuzzyjoin' Join Functions
R
99
star
5

fastpipe

A fast pipe implementation
R
85
star
6

nakedpipe

Pipe Into a Sequence of Calls Without Repeating the Pipe Symbol.
R
69
star
7

burglr

Copy Functions from Other Packages Without Adding Them As Dependencies
R
58
star
8

refactor

Tools for Refactoring Code
R
56
star
9

safejoin

Wrappers around dplyr functions to join safely using various checks
R
42
star
10

opt

Set Options Conveniently
R
40
star
11

reactibble

Use Dynamic Columns in Data Frames
R
40
star
12

inops

Infix Operators for Detection, Subsetting and Replacement
R
40
star
13

myverse

Easily Load a Set of Packages
R
26
star
14

boom

Print the Output of Intermediate Steps of a Call
R
23
star
15

devtag

Restrict Help Files to Development
R
20
star
16

pipediff

Show Diffs Between Piped Steps
R
20
star
17

doubt

Enable operators containing the '?' symbol
R
18
star
18

dotdot

Enhanced assignments. Use `..` on the right hand side as a shorthand for the left hand side.
R
17
star
19

qplyr

Delayed Evaluation With tidyverse Verbs
R
16
star
20

elephant

make variables remember their history
R
15
star
21

tricks

An Addin to Easily Program and Trigger Actions
R
14
star
22

tibbleprint

Print Data Frames Like Tibbles
R
14
star
23

ggframe

data frames that print as ggplots
R
14
star
24

tag

Build function operator factories supporting the tag$function(args) notation
R
13
star
25

editor

Edit scripts programatically
R
13
star
26

datasearch

Find Datasets Observing Specific Conditions
R
13
star
27

once

A Collection of Single Use Function Operators
R
11
star
28

pkg

Package Objects
R
10
star
29

ask

ask R anything
R
10
star
30

intercept

Intercept Messages and Warnings Based on Class, Package or Regular Expression
R
10
star
31

blame

Semantic Version Control for R
R
9
star
32

recycle

Set Hook on Garbage Collection
R
9
star
33

ggfail

A Quick And Dirty Package to Make Wrong ggplot Calls Fail
R
8
star
34

cutr

Enhanced cut And Useful Related Functions
R
8
star
35

tags

A collection of tags built using the package tag
R
8
star
36

now

Remove Exported Functions Depending On Lifecycle
R
7
star
37

liblog

Log Calls to loadNamespace
R
7
star
38

woof

wadlo's companion package
R
7
star
39

ggdollar

Use nested lists of functions to set ggplot theme attributes intuitively
R
7
star
40

shootnloot

Easily share objects between remote sessions
R
7
star
41

goto

What the Package Does (One Line, Title Case)
R
6
star
42

midi

What the Package Does (Title Case)
R
6
star
43

mmpipe

Not maintained, use *pipes* instead : https://github.com/moodymudskipper/pipes which has a cleaner implementation (and a few differences)
R
6
star
44

shinycheck

Check shiny Code
R
5
star
45

loop

Alternatives to apply Functions
R
5
star
46

withDT

Use data.table Syntax For One Call
R
4
star
47

replace

Replace Variable Names in R Scripts
R
4
star
48

ghstudio

Experimental tools to use git/github with RStudio
R
4
star
49

dot3

Tools to Manipulate the Ellipsis Object
R
3
star
50

devtag.example

An example using 'devtag'
R
3
star
51

tidygm

Music as Tidy Data Frames
R
3
star
52

flat

Flatten package to script you can source to recover the package
R
2
star
53

adventofcode2021

My Solutions for Advent Of Code 2021
R
2
star
54

github.traffic

What the Package Does (One Line, Title Case)
R
2
star
55

bigbrothr

Provide Automated Feedback to Package Maintainers on the usage of their package.
R
2
star
56

tabs

Extends rstudioapi
R
1
star
57

debugverse

Brainstorming ideas for debugging workflow and tools, not a package (yet ?)
1
star
58

docalltest

Some alternative to do.call and a comparison
R
1
star
59

check

Readable Assertions
R
1
star
60

flexaddins

What the Package Does (One Line, Title Case)
1
star
61

debugonce

Rstudio Addin to debugonce without typing
R
1
star
62

adventofcode2020

My Solutions for Advent Of Code 2020
R
1
star
63

tidyversediagrams

What the Package Does (One Line, Title Case)
R
1
star
64

private

private closures for closures
R
1
star
65

selfbm

Benchmark a Function against Itself
R
1
star
66

pbfor

RETIRED, use {once} instead! https://github.com/moodymudskipper/once
R
1
star
67

poof.tricks

What the Package Does (One Line, Title Case)
R
1
star
68

frankenply

Avoid Using Functionals by Prefixing your Arguments Directly in the Function Call
R
1
star
69

realquick

One line object summaries
1
star
70

pivot

Pivot Inside 'summarize' Calls
R
1
star
71

alt.doc

Alternative help files.
R
1
star
72

easydb

DBI and dplyr wrappers to write to DB, fetch and run data manipulation operations on server side.
R
1
star