• Stars
    star
    583
  • Rank 73,617 (Top 2 %)
  • Language
    R
  • License
    Other
  • Created almost 8 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Explore correlations in R

corrr

R-CMD-check CRAN_Status_Badge Codecov test coverage

corrr is a package for exploring correlations in R. It focuses on creating and working with data frames of correlations (instead of matrices) that can be easily explored via corrr functions or by leveraging tools like those in the tidyverse. This, along with the primary corrr functions, is represented below:

You can install:

  • the latest released version from CRAN with
install.packages("corrr")
  • the latest development version from GitHub with
# install.packages("remotes") 
remotes::install_github("tidymodels/corrr")

Using corrr

Using corrr typically starts with correlate(), which acts like the base correlation function cor(). It differs by defaulting to pairwise deletion, and returning a correlation data frame (cor_df) of the following structure:

  • A tbl with an additional class, cor_df
  • An extra โ€œtermโ€ column
  • Standardized variances (the matrix diagonal) set to missing values (NA) so they can be ignored.

API

The corrr API is designed with data pipelines in mind (e.g., to use %>% from the magrittr package). After correlate(), the primary corrr functions take a cor_df as their first argument, and return a cor_df or tbl (or output like a plot). These functions serve one of three purposes:

Internal changes (cor_df out):

  • shave() the upper or lower triangle (set to NA).
  • rearrange() the columns and rows based on correlation strengths.

Reshape structure (tbl or cor_df out):

  • focus() on select columns and rows.
  • stretch() into a long format.

Output/visualizations (console/plot out):

  • fashion() the correlations for pretty printing.
  • rplot() the correlations with shapes in place of the values.
  • network_plot() the correlations in a network.

Databases and Spark

The correlate() function also works with database tables. The function will automatically push the calculations of the correlations to the database, collect the results in R, and return the cor_df object. This allows for those results integrate with the rest of the corrr API.

Examples

library(MASS)
library(corrr)
set.seed(1)

# Simulate three columns correlating about .7 with each other
mu <- rep(0, 3)
Sigma <- matrix(.7, nrow = 3, ncol = 3) + diag(3)*.3
seven <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Simulate three columns correlating about .4 with each other
mu <- rep(0, 3)
Sigma <- matrix(.4, nrow = 3, ncol = 3) + diag(3)*.6
four <- mvrnorm(n = 1000, mu = mu, Sigma = Sigma)

# Bind together
d <- cbind(seven, four)
colnames(d) <- paste0("v", 1:ncol(d))

# Insert some missing values
d[sample(1:nrow(d), 100, replace = TRUE), 1] <- NA
d[sample(1:nrow(d), 200, replace = TRUE), 5] <- NA

# Correlate
x <- correlate(d)
class(x)
#> [1] "cor_df"     "tbl_df"     "tbl"        "data.frame"
x
#> # A tibble: 6 ร— 7
#>   term        v1       v2       v3       v4       v5      v6
#>   <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>
#> 1 v1    NA        0.684    0.716    0.00187 -0.00769 -0.0237
#> 2 v2     0.684   NA        0.702   -0.0248   0.00495 -0.0161
#> 3 v3     0.716    0.702   NA       -0.00171  0.0205  -0.0566
#> 4 v4     0.00187 -0.0248  -0.00171 NA        0.452    0.442 
#> 5 v5    -0.00769  0.00495  0.0205   0.452   NA        0.424 
#> 6 v6    -0.0237  -0.0161  -0.0566   0.442    0.424   NA

NOTE: Previous to corrr 0.4.3, the first column of a cor_df dataframe was named โ€œrownameโ€. As of corrr 0.4.3, the name of this first column changed to โ€œtermโ€.

As a tbl, we can use functions from data frame packages like dplyr, tidyr, ggplot2:

library(dplyr)

# Filter rows by correlation size
x %>% filter(v1 > .6)
#> # A tibble: 2 ร— 7
#>   term     v1     v2     v3       v4      v5      v6
#>   <chr> <dbl>  <dbl>  <dbl>    <dbl>   <dbl>   <dbl>
#> 1 v2    0.684 NA      0.702 -0.0248  0.00495 -0.0161
#> 2 v3    0.716  0.702 NA     -0.00171 0.0205  -0.0566

corrr functions work in pipelines (cor_df in; cor_df or tbl out):

x <- datasets::mtcars %>%
       correlate() %>%    # Create correlation data frame (cor_df)
       focus(-cyl, -vs, mirror = TRUE) %>%  # Focus on cor_df without 'cyl' and 'vs'
       rearrange() %>%  # rearrange by correlations
       shave() # Shave off the upper triangle for a clean result
#> Correlation computed with
#> โ€ข Method: 'pearson'
#> โ€ข Missing treated using: 'pairwise.complete.obs'
       
fashion(x)
#>   term  mpg drat   am gear qsec carb   hp   wt disp
#> 1  mpg                                             
#> 2 drat  .68                                        
#> 3   am  .60  .71                                   
#> 4 gear  .48  .70  .79                              
#> 5 qsec  .42  .09 -.23 -.21                         
#> 6 carb -.55 -.09  .06  .27 -.66                    
#> 7   hp -.78 -.45 -.24 -.13 -.71  .75               
#> 8   wt -.87 -.71 -.69 -.58 -.17  .43  .66          
#> 9 disp -.85 -.71 -.59 -.56 -.43  .39  .79  .89
rplot(x)

datasets::airquality %>% 
  correlate() %>% 
  network_plot(min_cor = .2)
#> Correlation computed with
#> โ€ข Method: 'pearson'
#> โ€ข Missing treated using: 'pairwise.complete.obs'

Contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

More Repositories

1

broom

Convert statistical analysis objects from R into tidy format
R
1,402
star
2

tidymodels

Easily install and load the tidymodels packages
R
727
star
3

infer

An R package for tidyverse-friendly statistical inference
R
702
star
4

parsnip

A tidy unified interface to models
R
554
star
5

TMwR

Code and content for "Tidy Modeling with R"
RMarkdown
552
star
6

recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
R
534
star
7

yardstick

Tidy methods for measuring model performance
R
354
star
8

rsample

Classes and functions to create and summarize resampling objects
R
318
star
9

stacks

An R package for tidy stacked ensemble modeling
R
284
star
10

tidypredict

Run predictions inside the database
R
256
star
11

tune

Tools for tidy parameter tuning
R
248
star
12

workflows

Modeling Workflows
R
193
star
13

textrecipes

Extra recipes for Text Processing
R
154
star
14

embed

Extra recipes for predictor embeddings
R
140
star
15

themis

Extra recipes steps for dealing with unbalanced data
R
138
star
16

butcher

Reduce the size of model objects saved to disk
R
130
star
17

censored

Parsnip wrappers for survival models
R
123
star
18

dials

Tools for creating tuning parameter values
R
110
star
19

probably

Tools for post-processing class probability estimates
R
108
star
20

tidyclust

A tidy unified interface to clustering models
R
103
star
21

tidyposterior

Bayesian comparisons of models using resampled statistics
R
101
star
22

tidymodels.org-legacy

Legacy Source of tidymodels.org
HTML
100
star
23

aml-training

The most recent version of the Applied Machine Learning notes
HTML
100
star
24

hardhat

Construct Modeling Packages
R
99
star
25

workflowsets

Create a collection of modeling workflows
R
88
star
26

usemodels

Boilerplate Code for tidymodels
R
85
star
27

modeldb

Run models inside a database using R
R
79
star
28

workshops

Website and materials for tidymodels workshops
JavaScript
76
star
29

multilevelmod

Parsnip wrappers for mixed-level and hierarchical models
R
72
star
30

spatialsample

Create and summarize spatial resampling objects ๐Ÿ—บ
R
69
star
31

learntidymodels

Learn tidymodels with interactive learnr primers
R
64
star
32

brulee

High-Level Modeling Functions with 'torch'
R
62
star
33

finetune

Additional functions for model tuning
R
61
star
34

shinymodels

R
45
star
35

applicable

Quantify extrapolation of new samples given a training set
R
43
star
36

model-implementation-principles

recommendations for creating R modeling packages
HTML
40
star
37

bonsai

parsnip wrappers for tree-based models
R
40
star
38

rules

parsnip extension for rule-based models
R
39
star
39

planning

Documents to plan and discuss future development
36
star
40

discrim

Wrappers for discriminant analysis and naive Bayes models for use with the parsnip package
R
28
star
41

baguette

parsnip Model Functions for Bagging
R
23
star
42

modeldata

Data Sets Used by tidymodels Packages
R
22
star
43

poissonreg

parsnip wrappers for Poisson regression
R
22
star
44

agua

Create and evaluate models using 'tidymodels' and 'h2o'
R
21
star
45

extratests

Integration and other testing for tidymodels
R
20
star
46

tidymodels.org

Source of tidymodels.org
JavaScript
16
star
47

plsmod

Model Wrappers for Projection Methods
R
14
star
48

cloudstart

RStudio Cloud โ˜๏ธ resources to accompany tidymodels.org
12
star
49

desirability2

Desirability Functions for Multiparameter Optimization
R
7
star
50

modeldatatoo

More Data Sets Useful for Modeling Examples
R
5
star
51

.github

GitHub contributing guidelines for tidymodels packages
4
star
52

modelenv

Provide Tools to Register Models for use in Tidymodels
R
3
star
53

survivalauc

What the Package Does (One Line, Title Case)
C
2
star