How To Make Your Data Analysis Notebooks More Reproducible
Slide deck | Slide deck as PDF
Resources
I have included a handful of links to papers, software packages and tutorials/manuals about some tools I mention in my talk. Pull requests or issues on additional ones to include are welcome.
Research Compendia
- Statistical Analysis and reproducible research
- Packaging Data Analytical Work Reproducibly Using R (and Friends) (OA preprint). A practical introduction to setting up a research compendium in R.
- The rOpenSci reproducibility guide Slightly dated but still very useful
Examples of Research Compendia on GitHub
Below are a few links to real world examples of research compendia in R. To have a minimal compendium, all you really need is a valid DESCRIPTION
file containing a handful of fields such as type, name, version and dependencies. See Marwick et al 2017 for a detailed description of the different types of compendia.
Small
Medium
Large
-
Non-parametric Bayesian Inference for Conservation Decisions
-
Find various other compendia on Github and Zenodo using the
research-compendium
tag.
Software packages related to research compendia
π¦ rrtools
by Ben Marwick (also the author of the packaging data analysis paper mentioned above) extends functions indevtools
and provides instructions, templates, and functions to make a basic compendium suitable for doing reproducible research with R.π¦ usethis
Many of the major function inrrtools
are imported fromusethis.
A savvy user can get by setting up and maintaining a compendium purely withusethis
functions.π¦ goodpractice
- Designed to help you build more robust packages, the package does a deep dive on your package contents and provide advice on syntax pitfalls to avoid, code formatting suggestions, and helps you improve overall package structure.- The
π¦ rticles
package by JJ has numerous journal templates and together with Rstudio addins like wordcountaddin
andcitr
+knitcitations
.
π Data management
π¦ piggyback
, [docs]: This clever R package allows you to attach arbitrary data (or other) files (upto 2gb each) to a GitHub release. Given GitHub's fast CDN, this would be an easy way to quickly attach large files to a compendium and read them back in a local/collaborator/remote environment very easily. As always be sure to archive a long-term copy on Zenodo.π¦ arkdb
[docs]: This package allows you to archive and unarchive databases as flat text files.π₯ For more on setting up data packages, see this excellent talk by Noam Ross at New York R.
Computational environments: Binder and friends
- My Binder is a free binderhub deployment that turns any Git repo into a collection of interactive notebooks. Now with better R support!
- For instructions on how to set this up for your R project, see my notes here
- Introducing Binder 2.0 β share your interactive research environment Paper describing the architecture of Binder in case you were interested in what was happening under the hood
π₯ A talk about Binder at Scipy 2018. Also see conference proceedings PDF.repo2docker
A Python module that will turn any repo (or local folder) into a Docker Image.
Other hosted Binder hubs
- Pangeo binder Pangeo encourages everyone to use it.
- gesis
- Syzgy Binder + JupyterHub for Compute Canada
Setting up Binder for your analysis
I have captured all the various ways to set up mybinder with a R project in a separate document.
Are you interested in setting up or hosting a binderhub for the R community? Get in touch via the issues.
Also see
- Whole Tale
- Computing environments for reproducibility: Capturing the βWhole Taleβ - OA paper describing the Whole Tale project.
- Code Ocean - A commercial, blackbox, full-stack service that will accomplish something similar to the above two projects. Code Ocean links will likely start appearing in papers soon.
Software packages related to setting up computational environments
π¦ Containerit
. Detailed blog post This sweet package will generate a Dockerfile for you by examining the code inside a folder or just from your session info. This is analogous torepo2docker
but is very R centricstevedore
Although there are a few docker clients (docker, harbor), this is my recommendation for managing docker containers from inside R.
π¨ Workflows: drake and friends
π¦ drake
- An R-focused pipeline toolkit for reproducibility and high-performance computing. Install the package from here or CRAN.- The prequel to the drake R package A blog post by the creator of drake describing his motivation for the package.
- drake manual A detailed
bookdown
guide on how to setup and use drake for projects of varying levels of complexity. - Presentation on drake Slides from a talk by Will Landau (who is here at the conference so go pick his brain if you want to learn more!)
Real world drake examples
Miscellaneous
- IKEA diagram inspired by IDEA instructions
Acknowledgments
Many thanks to Chris Holdgraf, Carl Boettiger, Will Landau, and Ben Marwick for various discussions on these topics. Also thanks to Ciera Martinez, Kara Woo, and Nick Tierney for comments on the presentation.