• Stars
    star
    284
  • Rank 140,123 (Top 3 %)
  • Language
    R
  • License
    Other
  • Created almost 4 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An R package for tidy stacked ensemble modeling

DOI badge R build status Codecov test coverage CRAN status

stacks - tidy model stacking

stacks is an R package for model stacking that aligns with the tidymodels. Model stacking is an ensembling method that takes the outputs of many models and combines them to generate a new model—referred to as an ensemble in this package—that generates predictions informed by each of its members.

The process goes something like this:

  1. Define candidate ensemble members using functionality from rsample, parsnip, workflows, recipes, and tune
  2. Initialize a data_stack object with stacks()
  3. Iteratively add candidate ensemble members to the data_stack with add_candidates()
  4. Evaluate how to combine their predictions with blend_predictions()
  5. Fit candidate ensemble members with non-zero stacking coefficients with fit_members()
  6. Predict on new data with predict()

You can install the package with the following code:

install.packages("stacks")

Install the development version with:

# install.packages("pak")
pak::pak("tidymodels/stacks")

stacks is generalized with respect to:

  • Model type: Any model type implemented in parsnip or extension packages is fair game to add to a stacks model stack. Here’s a table of many of the implemented model types in the tidymodels core, with a link there to an article about implementing your own model classes as well.
  • Cross-validation scheme: Any resampling algorithm implemented in rsample or extension packages is fair game for resampling data for use in training a model stack.
  • Error metric: Any metric function implemented in yardstick or extension packages is fair game for evaluating model stacks and their members. That package provides some infrastructure for creating your own metric functions as well!

stacks uses a regularized linear model to combine predictions from ensemble members, though this model type is only one of many possible learning algorithms that could be used to fit a stacked ensemble model. For implementations of additional ensemble learning algorithms, check out h2o and SuperLearner.

Rather than diving right into the implementation, we’ll focus here on how the pieces fit together, conceptually, in building an ensemble with stacks. See the basics vignette for an example of the API in action!

a grammar

At the highest level, ensembles are formed from model definitions. In this package, model definitions are an instance of a minimal workflow, containing a model specification (as defined in the parsnip package) and, optionally, a preprocessor (as defined in the recipes package). Model definitions specify the form of candidate ensemble members.

A diagram representing “model definitions,” which specify the form of candidate ensemble members. Three colored boxes represent three different model types; a K-nearest neighbors model (in salmon), a linear regression model (in yellow), and a support vector machine model (in green).

To be used in the same ensemble, each of these model definitions must share the same resample. This rsample rset object, when paired with the model definitions, can be used to generate the tuning/fitting results objects for the candidate ensemble members with tune.

A diagram representing “candidate members” generated from each model definition. Four salmon-colored boxes labeled “KNN” represent K-nearest neighbors models trained on the resamples with differing hyperparameters. Similarly, the linear regression model generates one candidate member, and the support vector machine model generates six.

Candidate members first come together in a data_stack object through the add_candidates() function. Principally, these objects are just tibbles, where the first column gives the true outcome in the assessment set (the portion of the training set used for model validation), and the remaining columns give the predictions from each candidate ensemble member. (When the outcome is numeric, there’s only one column per candidate ensemble member. Classification requires as many columns per candidate as there are levels in the outcome variable.) They also bring along a few extra attributes to keep track of model definitions.

A diagram representing a “data stack,” a specific kind of data frame. Colored “columns” depict, in white, the true value of the outcome variable in the validation set, followed by four columns (in salmon) representing the predictions from the K-nearest neighbors model, one column (in tan) representing the linear regression model, and six (in green) representing the support vector machine model.

Then, the data stack can be evaluated using blend_predictions() to determine to how best to combine the outputs from each of the candidate members. In the stacking literature, this process is commonly called metalearning.

The outputs of each member are likely highly correlated. Thus, depending on the degree of regularization you choose, the coefficients for the inputs of (possibly) many of the members will zero out—their predictions will have no influence on the final output, and those terms will thus be thrown out.

A diagram representing “stacking coefficients,” the coefficients of the linear model combining each of the candidate member predictions to generate the ensemble’s ultimate prediction. Boxes for each of the candidate members are placed besides each other, filled in with color if the coefficient for the associated candidate member is nonzero.

These stacking coefficients determine which candidate ensemble members will become ensemble members. Candidates with non-zero stacking coefficients are then fitted on the whole training set, altogether making up a model_stack object.

A diagram representing the “model stack” class, which collates the stacking coefficients and members (candidate members with nonzero stacking coefficients that are trained on the full training set). The representation of the stacking coefficients is as before, where the members (shown next to their associated stacking coefficients) are colored-in pentagons. Model stacks are a list subclass.

This model stack object, outputted from fit_members(), is ready to predict on new data! The trained ensemble members are often referred to as base models in the stacking literature.

The full visual outline for these steps can be found here. The API for the package closely mirrors these ideas. See the basics vignette for an example of how this grammar is implemented!

contributing

This project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

In the stacks package, some test objects take too long to build with every commit. If your contribution changes the structure of data_stack or model_stacks objects, please regenerate these test objects by running the scripts in man-roxygen/example_models.Rmd, including those with chunk options eval = FALSE.

More Repositories

1

broom

Convert statistical analysis objects from R into tidy format
R
1,402
star
2

tidymodels

Easily install and load the tidymodels packages
R
727
star
3

infer

An R package for tidyverse-friendly statistical inference
R
702
star
4

corrr

Explore correlations in R
R
583
star
5

parsnip

A tidy unified interface to models
R
554
star
6

TMwR

Code and content for "Tidy Modeling with R"
RMarkdown
552
star
7

recipes

Pipeable steps for feature engineering and data preprocessing to prepare for modeling
R
534
star
8

yardstick

Tidy methods for measuring model performance
R
354
star
9

rsample

Classes and functions to create and summarize resampling objects
R
318
star
10

tidypredict

Run predictions inside the database
R
256
star
11

tune

Tools for tidy parameter tuning
R
248
star
12

workflows

Modeling Workflows
R
193
star
13

textrecipes

Extra recipes for Text Processing
R
154
star
14

embed

Extra recipes for predictor embeddings
R
140
star
15

themis

Extra recipes steps for dealing with unbalanced data
R
138
star
16

butcher

Reduce the size of model objects saved to disk
R
130
star
17

censored

Parsnip wrappers for survival models
R
123
star
18

dials

Tools for creating tuning parameter values
R
110
star
19

probably

Tools for post-processing class probability estimates
R
108
star
20

tidyclust

A tidy unified interface to clustering models
R
103
star
21

tidyposterior

Bayesian comparisons of models using resampled statistics
R
101
star
22

tidymodels.org-legacy

Legacy Source of tidymodels.org
HTML
100
star
23

aml-training

The most recent version of the Applied Machine Learning notes
HTML
100
star
24

hardhat

Construct Modeling Packages
R
99
star
25

workflowsets

Create a collection of modeling workflows
R
88
star
26

usemodels

Boilerplate Code for tidymodels
R
85
star
27

modeldb

Run models inside a database using R
R
79
star
28

workshops

Website and materials for tidymodels workshops
JavaScript
76
star
29

multilevelmod

Parsnip wrappers for mixed-level and hierarchical models
R
72
star
30

spatialsample

Create and summarize spatial resampling objects 🗺
R
69
star
31

learntidymodels

Learn tidymodels with interactive learnr primers
R
64
star
32

brulee

High-Level Modeling Functions with 'torch'
R
62
star
33

finetune

Additional functions for model tuning
R
61
star
34

shinymodels

R
45
star
35

applicable

Quantify extrapolation of new samples given a training set
R
43
star
36

model-implementation-principles

recommendations for creating R modeling packages
HTML
40
star
37

bonsai

parsnip wrappers for tree-based models
R
40
star
38

rules

parsnip extension for rule-based models
R
39
star
39

planning

Documents to plan and discuss future development
36
star
40

discrim

Wrappers for discriminant analysis and naive Bayes models for use with the parsnip package
R
28
star
41

baguette

parsnip Model Functions for Bagging
R
23
star
42

modeldata

Data Sets Used by tidymodels Packages
R
22
star
43

poissonreg

parsnip wrappers for Poisson regression
R
22
star
44

agua

Create and evaluate models using 'tidymodels' and 'h2o'
R
21
star
45

extratests

Integration and other testing for tidymodels
R
20
star
46

tidymodels.org

Source of tidymodels.org
JavaScript
16
star
47

plsmod

Model Wrappers for Projection Methods
R
14
star
48

cloudstart

RStudio Cloud ☁️ resources to accompany tidymodels.org
12
star
49

desirability2

Desirability Functions for Multiparameter Optimization
R
7
star
50

modeldatatoo

More Data Sets Useful for Modeling Examples
R
5
star
51

.github

GitHub contributing guidelines for tidymodels packages
4
star
52

modelenv

Provide Tools to Register Models for use in Tidymodels
R
3
star
53

survivalauc

What the Package Does (One Line, Title Case)
C
2
star