• Stars
    star
    398
  • Rank 103,519 (Top 3 %)
  • Language
    R
  • License
    GNU General Publi...
  • Created almost 8 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Helper functions for modelling

modelr

Lifecycle: superseded R-CMD-check Codecov test coverage

Overview

The modelr package provides functions that help you create elegant pipelines when modelling. It was designed primarily to support teaching the basics of modelling for the 1st edition of R for Data Science.

We no longer recommend it and instead suggest https://www.tidymodels.org/ for a more comprehensive framework for modelling within the tidyverse.

Installation

# The easiest way to get modelr is to install the whole tidyverse:
install.packages("tidyverse")

# Alternatively, install just modelr:
install.packages("modelr")

Getting started

library(modelr)

Partitioning and sampling

The resample class stores a β€œreference” to the original dataset and a vector of row indices. A resample can be turned into a dataframe by calling as.data.frame(). The indices can be extracted using as.integer():

# a subsample of the first ten rows in the data frame
rs <- resample(mtcars, 1:10)
as.data.frame(rs)
#>                    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
#> Mazda RX4         21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
#> Mazda RX4 Wag     21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
#> Datsun 710        22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
#> Hornet 4 Drive    21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
#> Hornet Sportabout 18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
#> Valiant           18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
#> Duster 360        14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
#> Merc 240D         24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
#> Merc 230          22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
#> Merc 280          19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
as.integer(rs)
#>  [1]  1  2  3  4  5  6  7  8  9 10

The class can be utilized in generating an exclusive partitioning of a data frame:

# generate a 30% testing partition and a 70% training partition
ex <- resample_partition(mtcars, c(test = 0.3, train = 0.7))
lapply(ex, dim)
#> $test
#> [1]  9 11
#> 
#> $train
#> [1] 23 11

modelr offers several resampling methods that result in a list of resample objects (organized in a data frame):

# bootstrap
boot <- bootstrap(mtcars, 100)
# k-fold cross-validation
cv1 <- crossv_kfold(mtcars, 5)
# Monte Carlo cross-validation
cv2 <- crossv_mc(mtcars, 100)

dim(boot$strap[[1]])
#> [1] 32 11
dim(cv1$train[[1]])
#> [1] 25 11
dim(cv1$test[[1]])
#> [1]  7 11
dim(cv2$train[[1]])
#> [1] 25 11
dim(cv2$test[[1]])
#> [1]  7 11

Model quality metrics

modelr includes several often-used model quality metrics:

mod <- lm(mpg ~ wt, data = mtcars)
rmse(mod, mtcars)
#> [1] 2.949163
rsquare(mod, mtcars)
#> [1] 0.7528328
mae(mod, mtcars)
#> [1] 2.340642
qae(mod, mtcars)
#>        5%       25%       50%       75%       95% 
#> 0.1784985 1.0005640 2.0946199 3.2696108 6.1794815

Interacting with models

A set of functions let you seamlessly add predictions and residuals as additional columns to an existing data frame:

set.seed(1014)
df <- tibble::tibble(
  x = sort(runif(100)),
  y = 5 * x + 0.5 * x ^ 2 + 3 + rnorm(length(x))
)

mod <- lm(y ~ x, data = df)
df %>% add_predictions(mod)
#> # A tibble: 100 Γ— 3
#>          x     y  pred
#>      <dbl> <dbl> <dbl>
#>  1 0.00740 3.90   3.08
#>  2 0.0201  2.86   3.15
#>  3 0.0280  2.93   3.19
#>  4 0.0281  3.16   3.19
#>  5 0.0312  3.19   3.21
#>  6 0.0342  3.72   3.23
#>  7 0.0514  0.984  3.32
#>  8 0.0586  5.98   3.36
#>  9 0.0637  2.96   3.39
#> 10 0.0652  3.54   3.40
#> # β„Ή 90 more rows
df %>% add_residuals(mod)
#> # A tibble: 100 Γ— 3
#>          x     y   resid
#>      <dbl> <dbl>   <dbl>
#>  1 0.00740 3.90   0.822 
#>  2 0.0201  2.86  -0.290 
#>  3 0.0280  2.93  -0.256 
#>  4 0.0281  3.16  -0.0312
#>  5 0.0312  3.19  -0.0223
#>  6 0.0342  3.72   0.496 
#>  7 0.0514  0.984 -2.34  
#>  8 0.0586  5.98   2.62  
#>  9 0.0637  2.96  -0.428 
#> 10 0.0652  3.54   0.146 
#> # β„Ή 90 more rows

For visualization purposes it is often useful to use an evenly spaced grid of points from the data:

data_grid(mtcars, wt = seq_range(wt, 10), cyl, vs)
#> # A tibble: 60 Γ— 3
#>       wt   cyl    vs
#>    <dbl> <dbl> <dbl>
#>  1  1.51     4     0
#>  2  1.51     4     1
#>  3  1.51     6     0
#>  4  1.51     6     1
#>  5  1.51     8     0
#>  6  1.51     8     1
#>  7  1.95     4     0
#>  8  1.95     4     1
#>  9  1.95     6     0
#> 10  1.95     6     1
#> # β„Ή 50 more rows

# For continuous variables, seq_range is useful
mtcars_mod <- lm(mpg ~ wt + cyl + vs, data = mtcars)
data_grid(mtcars, wt = seq_range(wt, 10), cyl, vs) %>% add_predictions(mtcars_mod)
#> # A tibble: 60 Γ— 4
#>       wt   cyl    vs  pred
#>    <dbl> <dbl> <dbl> <dbl>
#>  1  1.51     4     0  28.4
#>  2  1.51     4     1  28.9
#>  3  1.51     6     0  25.6
#>  4  1.51     6     1  26.2
#>  5  1.51     8     0  22.9
#>  6  1.51     8     1  23.4
#>  7  1.95     4     0  27.0
#>  8  1.95     4     1  27.5
#>  9  1.95     6     0  24.2
#> 10  1.95     6     1  24.8
#> # β„Ή 50 more rows

More Repositories

1

ggplot2

An implementation of the Grammar of Graphics in R
R
6,279
star
2

dplyr

dplyr: A grammar of data manipulation
R
4,627
star
3

tidyverse

Easily install and load packages from the tidyverse
R
1,575
star
4

rvest

Simple web scraping for R
R
1,455
star
5

tidyr

Tidy Messy Data
R
1,323
star
6

purrr

A functional programming toolkit for R
R
1,211
star
7

readr

Read flat files (csv, tsv, fwf) into R
R
985
star
8

magrittr

Improve the readability of R code with the pipe
R
952
star
9

datascience-box

Data Science Course in a Box
JavaScript
895
star
10

reprex

Render bits of R code for sharing, e.g., on GitHub or StackOverflow.
R
726
star
11

readxl

Read excel files (.xls and .xlsx) into R πŸ–‡
C++
713
star
12

lubridate

Make working with dates in R just that little bit easier
R
712
star
13

glue

Glue strings to data in R. Small, fast, dependency free interpreted string literals.
R
685
star
14

dtplyr

Data table backend for dplyr
R
656
star
15

tibble

A modern re-imagining of the data frame
R
641
star
16

multidplyr

A dplyr backend that partitions a data frame over multiple processes
R
636
star
17

vroom

Fast reading of delimited files
C++
604
star
18

stringr

A fresh approach to string manipulation in R
R
565
star
19

forcats

🐈🐈🐈🐈: tools for working with categorical variables (factors)
R
533
star
20

dbplyr

Database (DBI) backend for dplyr
R
455
star
21

haven

Read SPSS, Stata and SAS files from R
C
415
star
22

googlesheets4

Google Spreadsheets R API (reboot of the googlesheets package)
R
347
star
23

googledrive

Google Drive R API
R
312
star
24

style

The tidyverse style guide for R code
HTML
285
star
25

design

Tidyverse design principles
R
208
star
26

tidyverse.org

Source of tidyverse.org
HTML
186
star
27

hms

A simple class for storing time-of-day values
R
136
star
28

nycflights13

An R data package containing all out-bound flights from NYC in 2013 + useful metdata
R
121
star
29

tidyversedashboard

Tidyverse activity dashboard
R
71
star
30

tidy-dev-day

Tidyverse developer day
59
star
31

tidyeval

A guide to tidy evaluation
CSS
54
star
32

dsbox

Companion R package to Data Science Course in a Box
R
47
star
33

tidytemplate

A pkgdown template for core tidyverse packages
SCSS
45
star
34

blob

A simple S3 class for representing BLOBs
R
44
star
35

code-review

32
star
36

funs

Collection of low-level functions for working with vctrs
R
31
star
37

website-analytics

Web analytics for tidyverse + r-lib sites
R
28
star
38

tidyups

20
star
39

ggplot2-docs

ggplot2 documentation. Auto-generated from ggplot2 sources by pkgdown
HTML
10
star