• Stars
    star
    324
  • Rank 124,924 (Top 3 %)
  • Language
    R
  • License
    Other
  • Created almost 8 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Widen, process, and re-tidy a dataset

widyr: Widen, process, and re-tidy a dataset

Authors: Julia Silge, David Robinson
License: MIT

R-CMD-check CRAN_Status_Badge Codecov test coverage

This package wraps the pattern of un-tidying data into a wide matrix, performing some processing, then turning it back into a tidy form. This is useful for several mathematical operations such as co-occurrence counts, correlations, or clustering that are best done on a wide matrix.

Installation

You can install the released version of widyr from CRAN with:

install.packages("widyr")

And the development version from GitHub with:

# install.packages("devtools")
devtools::install_github("juliasilge/widyr")

Towards a precise definition of โ€œwideโ€ data

The term โ€œwide dataโ€ has gone out of fashion as being โ€œimpreciseโ€ (Wickham 2014), but I think with a proper definition the term could be entirely meaningful and useful.

A wide dataset is one or more matrices where:

  • Each row is one item
  • Each column is one feature
  • Each value is one observation
  • Each matrix is one variable

When would you want data to be wide rather than tidy? Notable examples include classification, clustering, correlation, factorization, or other operations that can take advantage of a matrix structure. In general, when you want to compare between pairs of items rather than compare between variables or between groups of observations, this is a useful structure.

The widyr package is based on the observation that during a tidy data analysis, you often want data to be wide only temporarily, before returning to a tidy structure for visualization and further analysis. widyr makes this easy through a set of pairwise_ functions.

Example: gapminder

Consider the gapminder dataset in the gapminder package.

library(dplyr)
library(gapminder)

gapminder
#> # A tibble: 1,704 ร— 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # โ€ฆ with 1,694 more rows
#> # โ„น Use `print(n = ...)` to see more rows

This tidy format (one-row-per-country-per-year) is very useful for grouping, summarizing, and filtering operations. But if we want to compare countries (for example, to find countries that are similar to each other), we would have to reshape this dataset. Note that here, each country is an item, while each year is the feature.

Pairwise operations

The widyr package offers pairwise_ functions that operate on pairs of items within data. An example is pairwise_dist:

library(widyr)

gapminder %>%
  pairwise_dist(country, year, lifeExp)
#> # A tibble: 20,022 ร— 3
#>    item1      item2       distance
#>    <fct>      <fct>          <dbl>
#>  1 Albania    Afghanistan   107.  
#>  2 Algeria    Afghanistan    76.8 
#>  3 Angola     Afghanistan     4.65
#>  4 Argentina  Afghanistan   110.  
#>  5 Australia  Afghanistan   129.  
#>  6 Austria    Afghanistan   124.  
#>  7 Bahrain    Afghanistan    98.1 
#>  8 Bangladesh Afghanistan    45.3 
#>  9 Belgium    Afghanistan   125.  
#> 10 Benin      Afghanistan    39.3 
#> # โ€ฆ with 20,012 more rows
#> # โ„น Use `print(n = ...)` to see more rows

This finds the Euclidean distance between the lifeExp value in each pair of countries. It knows which values to compare between countries with year, which is the feature column.

We could find the closest pairs of countries overall with arrange():

gapminder %>%
  pairwise_dist(country, year, lifeExp) %>%
  arrange(distance)
#> # A tibble: 20,022 ร— 3
#>    item1          item2          distance
#>    <fct>          <fct>             <dbl>
#>  1 Germany        Belgium            1.08
#>  2 Belgium        Germany            1.08
#>  3 United Kingdom New Zealand        1.51
#>  4 New Zealand    United Kingdom     1.51
#>  5 Norway         Netherlands        1.56
#>  6 Netherlands    Norway             1.56
#>  7 Italy          Israel             1.66
#>  8 Israel         Italy              1.66
#>  9 Finland        Austria            1.94
#> 10 Austria        Finland            1.94
#> # โ€ฆ with 20,012 more rows
#> # โ„น Use `print(n = ...)` to see more rows

Notice that this includes duplicates (Germany/Belgium and Belgium/Germany). To avoid those (the upper triangle of the distance matrix), use upper = FALSE:

gapminder %>%
  pairwise_dist(country, year, lifeExp, upper = FALSE) %>%
  arrange(distance)
#> # A tibble: 10,011 ร— 3
#>    item1       item2          distance
#>    <fct>       <fct>             <dbl>
#>  1 Belgium     Germany            1.08
#>  2 New Zealand United Kingdom     1.51
#>  3 Netherlands Norway             1.56
#>  4 Israel      Italy              1.66
#>  5 Austria     Finland            1.94
#>  6 Belgium     United Kingdom     1.95
#>  7 Iceland     Sweden             2.01
#>  8 Comoros     Mauritania         2.01
#>  9 Belgium     United States      2.09
#> 10 Germany     Ireland            2.10
#> # โ€ฆ with 10,001 more rows
#> # โ„น Use `print(n = ...)` to see more rows

In some analyses, we may be interested in correlation rather than distance of pairs. For this we would use pairwise_cor:

gapminder %>%
  pairwise_cor(country, year, lifeExp, upper = FALSE)
#> # A tibble: 10,011 ร— 3
#>    item1       item2     correlation
#>    <fct>       <fct>           <dbl>
#>  1 Afghanistan Albania         0.966
#>  2 Afghanistan Algeria         0.987
#>  3 Albania     Algeria         0.953
#>  4 Afghanistan Angola          0.986
#>  5 Albania     Angola          0.976
#>  6 Algeria     Angola          0.952
#>  7 Afghanistan Argentina       0.971
#>  8 Albania     Argentina       0.949
#>  9 Algeria     Argentina       0.991
#> 10 Angola      Argentina       0.936
#> # โ€ฆ with 10,001 more rows
#> # โ„น Use `print(n = ...)` to see more rows

Code of Conduct

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

More Repositories

1

tidytext

Text mining using tidy tools โœจ๐Ÿ“„โœจ
R
1,151
star
2

supervised-ML-case-studies-course

Supervised machine learning case studies in R! ๐Ÿ’ซ A free interactive tidymodels course
CSS
220
star
3

women-in-film

Code and analysis for the Pudding's text mining project ๐ŸŽฅ๐Ÿ’ƒ๐Ÿป
111
star
4

janeaustenr

An R Package for Jane Austen's Complete Novels ๐Ÿ“™
R
94
star
5

tidylo

Weighted tidy log odds ratio โš–๏ธ
R
91
star
6

modelops-playground

Experiments and use cases for model monitoring and updating ๐Ÿงฎ
R
55
star
7

silgelib

Personal R package
R
47
star
8

learntidytext

Learn about text mining ๐Ÿ“„ with tidy data principles
CSS
46
star
9

tidymodels-tutorial

Introduction to ML with R using tidymodels
R
44
star
10

juliasilge.com

My blog, built with blogdown and Hugo ๐Ÿ”—
HTML
40
star
11

intro_to_shiny

Introduction to Shiny workshop for satRday conference
HTML
25
star
12

old_bloggy_blog

My older Jekyll blog, based on the So Simple theme ๐Ÿ”—
JavaScript
24
star
13

deploytidymodels

Version, share, and deploy tidymodels workflows
R
21
star
14

tidytext-tutorial

Materials for leading tutorials on text mining using tidy data principles
Lua
20
star
15

ibm-ai-day

Presentation for IBM Community Day AI
HTML
14
star
16

caret-ML-course

Supervised machine learning case studies with caret in R! ๐ŸŒŸ A free interactive course
CSS
14
star
17

sdss2019

Presentation for short course at Symposium on Data Science and Statistics in May 2019
HTML
13
star
18

sherlock-holmes

Files for Sherlock Holmes analysis
R
12
star
19

course-ML-tidymodels

Temporary repo to build new version of course
CSS
12
star
20

blog_by_hugo

First Hugo version of my blog, built with blogdown ๐Ÿ”—
HTML
11
star
21

deming2018

Presentation for Deming Conference in December 2018
HTML
11
star
22

plotly-quarto-ghpages

GH pages + Quarto example for Plotly
CSS
9
star
23

neissapp

Shiny app for the NEISS data set
8
star
24

packagesurvey

Survey of R users on package discovery ๐Ÿ“ฆ
R
7
star
25

opioids

Analysis of opioid use in Texas
7
star
26

learning-sql

A repository to use as I learn about relational algebra, databases, and SQL
7
star
27

vetiverdemo

Demos for MLOps with vetiver
R
7
star
28

why-r-webinar

Webinar on word embeddings for Why R?
HTML
7
star
29

vamplyr

๐ŸŽƒ SPOOKY VIBES ๐ŸŽƒ
R
6
star
30

nasanotebooks

Notebooks and other materials for NASA Datanauts projects ๐ŸŒŒโ˜€๏ธ๐Ÿš€
HTML
6
star
31

ml-maintenance-2023

Talk for posit::conf() 2023 on reliable maintenance of machine learning models
CSS
6
star
32

normconf-simulation

Slides for talk at NormConf on efficient simulation for everyday ML decisions
CSS
5
star
33

jsm2018

Presentations for JSM 2018
JavaScript
5
star
34

stacksurveyapp

Shiny app for the Stack Overflow Developer Survey
5
star
35

tada2022

Text as Data (TADA ๐Ÿช„) Conference 2022
CSS
4
star
36

byu-seminar

Seminar on word embeddings at BYU
HTML
4
star
37

ggplot2-tutorial

Just the best plots EVER ๐Ÿ“Š
4
star
38

toy-bookdown

Build bookdown on Linux
TeX
3
star
39

populationapp

Shiny app exploring U.S. population density
2
star
40

fall2016competition

Materials for Utah Geek Events Fall 2016 College Scorecard Competition
Jupyter Notebook
2
star
41

southafricastats

Population and Mortality Statistics for South Africa
R
2
star
42

writing-about-tech

Talk for Code for Philly
JavaScript
2
star
43

choroplethrUTCensusTract

Shapefile, Metadata, and Visualization Functions for US Census Tracts in Utah ๐ŸŒ„
R
2
star
44

mlops-rstudio-meetup

RStudio Enterprise Meetup on MLOps with vetiver ๐Ÿบ in Python and R
Lua
2
star
45

SLCWaterMapping

Code and a Shiny app exploring water use in Salt Lake City
R
1
star
46

so-offsite-2018

Presentation for Stack Overflow offsite in October 2018
HTML
1
star
47

holidaymovies

What movie should we watch? ๐ŸŽ„
R
1
star
48

juliasilge

Profile README
1
star
49

PredictNamesApp

A Shiny App for exploring baby name popularity
R
1
star