• Stars
    star
    136
  • Rank 267,670 (Top 6 %)
  • Language
    HTML
  • License
    GNU General Publi...
  • Created about 8 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An R package for time series models and forecasts with xgboost compatible with {forecast} S3 classes

forecastxgb-r-package

The forecastxgb package provides time series modelling and forecasting functions that combine the machine learning approach of Chen, He and Benesty's xgboost with the convenient handling of time series and familiar API of Rob Hyndman's forecast. It applies to time series the Extreme Gradient Boosting proposed in Greedy Function Approximation: A Gradient Boosting Machine, by Jermoe Friedman in 2001. xgboost has become an important machine learning algorithm; nicely explained in this accessible documentation.

Travis-CI Build Status CRAN version CRAN RStudio mirror downloads

Warning: this package is under active development and is some way off a CRAN release (realistically, no some time in 2017). Currently the forecasting results with the default settings are, frankly, pretty rubbish, but there is hope I can get better settings. The API and default values of arguments should be expected to continue to change.

Installation

Only on GitHub, but plan for a CRAN release in November 2016. Comments and suggestions welcomed.

This implementation uses as explanatory features:

  • lagged values of the response variable
  • dummy variables for seasons.
  • current and lagged values of any external regressors supplied as xreg
devtools::install_github("ellisp/forecastxgb-r-package/pkg")

Usage

Basic usage

The workhorse function is xgbar. This fits a model to a time series. Under the hood, it creates a matrix of explanatory variables based on lagged versions of the response time series, and (optionally) dummy variables of some sort for seasons. That matrix is then fed as the feature set for xgboost to do its stuff.

Univariate

Usage with default values is straightforward. Here it is fit to Australian monthly gas production 1956-1995, an example dataset provided in forecast:

library(forecastxgb)
model <- xgbar(gas)

(Note: the "Stopping. Best iteration..." to the screen is produced by xgboost::xgb.cv, which uses cat() rather than message() to print information on its processing.)

By default, xgbar uses row-wise cross-validation to determine the best number of rounds of iterations for the boosting algorithm without overfitting. A final model is then fit on the full available dataset. The relative importance of the various features in the model can be inspected by importance_xgb() or, more conveniently, the summary method for objects of class xgbar.

summary(model)

Importance of features in the xgboost model:
    Feature         Gain        Cover   Frequency
 1:   lag12 5.097936e-01 0.1480752533 0.078475336
 2:   lag11 2.796867e-01 0.0731403763 0.042600897
 3:   lag13 1.043604e-01 0.0355137482 0.031390135
 4:   lag24 7.807860e-02 0.1320115774 0.069506726
 5:    lag1 1.579312e-02 0.1885383502 0.181614350
 6:   lag23 5.616290e-03 0.0471490593 0.042600897
 7:    lag9 2.510372e-03 0.0459623734 0.040358744
 8:    lag2 6.759874e-04 0.0436179450 0.053811659
 9:   lag14 5.874155e-04 0.0311432706 0.026905830
10:   lag10 5.467606e-04 0.0530535456 0.053811659
11:    lag6 3.820611e-04 0.0152243126 0.033632287
12:    lag4 2.188107e-04 0.0098697540 0.035874439
13:   lag22 2.162973e-04 0.0103617945 0.017937220
14:   lag16 2.042320e-04 0.0098118669 0.013452915
15:   lag21 1.962725e-04 0.0149638205 0.026905830
16:   lag18 1.810734e-04 0.0243994211 0.029147982
17:    lag3 1.709305e-04 0.0132850941 0.035874439
18:    lag5 1.439827e-04 0.0231837916 0.033632287
19:   lag15 1.313859e-04 0.0143560058 0.031390135
20:   lag17 1.239889e-04 0.0109696093 0.017937220
21: season7 1.049934e-04 0.0081041968 0.015695067
22:    lag8 9.773024e-05 0.0123299566 0.026905830
23:   lag19 7.733822e-05 0.0112879884 0.015695067
24:   lag20 5.425515e-05 0.0072648336 0.011210762
25:    lag7 3.772907e-05 0.0105354559 0.020179372
26: season4 4.067607e-06 0.0010709117 0.002242152
27: season5 2.863805e-06 0.0022286541 0.006726457
28: season6 2.628821e-06 0.0021707670 0.002242152
29: season9 9.226827e-08 0.0003762663 0.002242152
    Feature         Gain        Cover   Frequency

 35 features considered.
476 original observations.
452 effective observations after creating lagged features.

We see in the case of the gas data that the most important feature in explaining gas production is the production 12 months previously; and then other features decrease in importance from there but still have an impact.

Forecasting is the main purpose of this package, and a forecast method is supplied. The resulting objects are of class forecast and familiar generic functions work with them.

fc <- forecast(model, h = 12)
plot(fc)

plot of chunk unnamed-chunk-5

Note that prediction intervals are not currently available.

See the vignette for more extended examples.

With external regressors

External regressors can be added by using the xreg argument familiar from other forecast functions like auto.arima and nnetar. xreg can be a vector or ts object but is easiest to integrate into the analysis if it is a matrix (even a matrix with one column) with well-chosen column names; that way feature names persist meaningfully.

The example below, with data taken from the fpp package supporting Athanasopoulos and Hyndman's Forecasting Principles and Practice book, shows income being used to explain consumption. In the same way that the response variable y is expanded into lagged versions of itself, each column in xreg is expanded into lagged versions, which are then treated as individual features for xgboost.

library(fpp)
consumption <- usconsumption[ ,1]
income <- matrix(usconsumption[ ,2], dimnames = list(NULL, "Income"))
consumption_model <- xgbar(y = consumption, xreg = income)
summary(consumption_model)

Importance of features in the xgboost model:
        Feature        Gain       Cover   Frequency
 1:        lag2 0.253763903 0.082908446 0.124513619
 2:        lag1 0.219332682 0.114608734 0.171206226
 3: Income_lag0 0.115604367 0.183107958 0.085603113
 4:        lag3 0.064652150 0.093105742 0.089494163
 5:        lag8 0.055645114 0.099756152 0.066147860
 6: Income_lag8 0.050460959 0.049434715 0.050583658
 7: Income_lag1 0.047187235 0.088561295 0.050583658
 8: Income_lag6 0.040512834 0.029150964 0.050583658
 9:        lag6 0.031876878 0.044225227 0.054474708
10: Income_lag2 0.020355402 0.015739304 0.031128405
11: Income_lag5 0.018011250 0.036577256 0.035019455
12:        lag5 0.017930780 0.032143649 0.035019455
13:        lag7 0.016674036 0.034249612 0.027237354
14: Income_lag4 0.015952784 0.025714919 0.038910506
15: Income_lag7 0.009850701 0.021724673 0.019455253
16:        lag4 0.008819146 0.028929284 0.038910506
17: Income_lag3 0.008720737 0.013855021 0.019455253
18:     season4 0.003152234 0.001551762 0.003891051
19:     season3 0.001496807 0.004655287 0.007782101

 20 features considered.
164 original observations.
156 effective observations after creating lagged features.

We see that the two most important features explaining consumption are the two previous quarters' values of consumption; followed by the income in this quarter; and so on.

The challenge of using external regressors in a forecasting environment is that to forecast, you need values of the future external regressors. One way this is sometimes done is by first forecasting the individual regressors. In the example below we do this, making sure the data structure is the same as the original xreg. When the new value of xreg is given to forecast, it forecasts forward the number of rows of the new xreg.

income_future <- matrix(forecast(xgbar(usconsumption[,2]), h = 10)$mean, 
                        dimnames = list(NULL, "Income"))
plot(forecast(consumption_model, xreg = income_future))

plot of chunk unnamed-chunk-7

Options

Seasonality

Currently there are three methods of treating seasonality.

  • The current default method is to throw dummy variables for each season into the mix of features for xgboost to work with.
  • An alternative is to perform classic multiplicative seasonal adjustment on the series before feeding it to xgboost. This seems to work better.
  • A third option is to create a set of pairs of Fourier transform variables and use them as x regressors
No h provided so forecasting forward 24 periods.

plot of chunk unnamed-chunk-8

No h provided so forecasting forward 24 periods.

plot of chunk unnamed-chunk-8

No h provided so forecasting forward 24 periods.

plot of chunk unnamed-chunk-8

All methods perform quite poorly at the moment, suffering from the difficulty the default settings have in dealing with non-stationary data (see below).

Transformations

The data can be transformed by a modulus power transformation (as per John and Draper, 1980) before feeding to xgboost. This transformation is similar to a Box-Cox transformation, but works with negative data. Leaving the lambda parameter as 1 will effectively switch off this transformation.

No h provided so forecasting forward 24 periods.

plot of chunk unnamed-chunk-9

No h provided so forecasting forward 24 periods.

plot of chunk unnamed-chunk-9

Version 0.0.9 of forecastxgb gave lambda the default value of BoxCox.lambda(abs(y)). This returned spectacularly bad forecasting results. Forcing this to be between 0 and 1 helped a little, but still gave very bad results. So far there isn't evidence (but neither is there enough investigation) that a Box Cox transformation helps xgbar do its model fitting at all.

Non-stationarity

From experiments so far, it seems the basic idea of xgboost struggles in this context with extrapolation into a new range of variables not in the training set. This suggests better results might be obtained by transforming the series into a stationary one before modelling - a similar approach to that taken by forecast::auto.arima. This option is available by trend_method = "differencing" and seems to perform well - certainly better than without - and it will probably be made a default setting once more experience is available.

model <- xgbar(AirPassengers, trend_method = "differencing", seas_method = "fourier")
plot(forecast(model, 24))

plot of chunk unnamed-chunk-10

Future developments

Future work might include:

  • additional automated time-dependent features (eg dummy variables for trading days, Easter, etc)
  • ability to include xreg values that don't get lagged
  • some kind of automated multiple variable forecasting, similar to a vector-autoregression.
  • better choices of defaults for values such as lambda (for power transformations), K (for Fourier transforms) and, most likely to be effective, maxlag.

More Repositories

1

forecastHybrid

Convenient functions for ensemble forecasts in R combining approaches from the {forecast} package
R
80
star
2

ggseas

seasonal adjustment on the fly extension for ggplot2
R
72
star
3

rmarkdown-corporate-eg

Example folder system for a corporately-styled RMarkdown document
R
33
star
4

nzelect

New Zealand election and census results data in convenient form of two R packages - nzelect and nzcensus
R
30
star
5

ellisp.github.io

Blog
HTML
26
star
6

ozfedelect

Australian federal elections data and forecasts
R
23
star
7

nz-election-forecast

Probabilitstic prediction model for the New Zealand general election
HTML
21
star
8

blog-source

Source code for http://freerangestats.info
HTML
17
star
9

ggflags

flag geom for ggplot2
R
15
star
10

frs-r-package

R package for miscellaneous functions associated with Peter's Stats Stuff
R
7
star
11

shiny-multi-demo

Demo Shiny app in various stages of creation
R
5
star
12

control-charts

exploration of control charts with different true values of p and sample sizes
R
5
star
13

Tcomp-r-package

Datasets from the 2011 Tourism Forecasting Competition
R
5
star
14

covid-symptom-challenge

Peter Ellis' entry in the COVID-19 Symptom Data Challenge, in partnership with Facebook Data for Good, Delphi Group at Carnegie Mellon University, the Joint Program on Survey Methodology at the University of Maryland, the Duke Margolis Center for Health Policy, and Resole to Save Lives, an initiative of Vital Strategies. Organised by Catalyst @Health 2.0.
R
5
star
15

graphics-taster

Presentation on graphics in June 2016
R
4
star
16

Rrobot

A twitterbot that tweets about #rstats
R
4
star
17

nyc-taxis

Yet another set of analysis of the New York City Taxis
TSQL
3
star
18

prediction-intervals-asc

Analysis and presentation on forecastHybrid and prediction intervals, for the Australian Statistical Conference SSA stream in December 2016
HTML
2
star
19

mtagdp-nzae

Article on Modelled Territorial Authority Gross Domestic Product for the New Zealand Association of Economists conference 2016
TeX
2
star
20

usa-count-2020

Chart of key battleground states in 2020 election
R
1
star
21

bigdata-eval-2016

Presentation on big data and program evaluation May 2016 for ANZEA
HTML
1
star
22

nz-census-explorer

An experimental Shiny app to explore the New Zealand census data from 2001, 2006 and 2013.
R
1
star
23

wwi-ship-data

Data from logs and elsewhere on UK Royal Navy ships in World War I
R
1
star
24

inequality

Analysis of inequality in NZ and possibly some other places; will see how we go.
R
1
star
25

twitter-misc

Miscellaneous data collection and analysis based on Twitter
R
1
star
26

sim-health-microdata

Example simulations of health microdata
R
1
star
27

survey_microdata

Database of microdata from open source surveys
R
1
star