• Stars
    star
    248
  • Rank 163,560 (Top 4 %)
  • Language
  • Created over 9 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Use of an R package to facilitate reproducible research

rrrpkg

Use of an R package to facilitate reproducible research

What is a research compendium?

We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data,...), and as a means for distributing, managing and updating the collection. - Gentleman, R. and Temple Lang, D. (2004)

The goal of a research compendium is to provide a standard and easily recognisable way for organising a reproducible research project with R. A research compendium is ideal for projects that result in the publication of a paper because then readers of the paper can access the code and data that generated the results in the paper. A research compendium is a convention for how you organise your research artefacts into directories. The guiding principle in creating a research compendium is to organise your files following conventions that many people use. Following these conventions will help other people instantly familiarise themselves with the structure of your project, and also support tool building which takes advantage of the shared structure.

Some of the earliest examples of this approach can be found in Robert Gentleman and Duncan Temple Lang's 2004 paper "Statistical Analyses and Reproducible Research" Bioconductor Project Working Papers and Gentleman's 2005 article "Reproducible Research: A Bioinformatics Case Study" in Statistical Applications in Genetics and Molecular Biology. Since then there has been a substantial increase in the use of R as a research tool in many fields, and numerous improvements in the ease of making R packages. This means that making a research compendium based on an R package is now a practical solution to the challenges of organising and communicating research results for many scientists.

Why create a research compendium?

Using research compendia simplifies file management and streamlines analytical workflows, making your research more efficient. A compendium makes it easier to communicate your work with other researchers (and your future self), to demonstrate the correctness of your results. This can lead to higher visibility of your work, receiving credit for code as well as the paper, a boost in citations, and allows others to more easily build on your work.

How to make a research compendium

At its simplest, a research compendium might consist of a single file of R code with inline comments documenting the workflow. A slightly more complex approach might be a R markdown file with text and code in the same source document. In many cases these simple approaches will be ideal, and more elaborate organisation would add unnecessary complexity and points of failure. But many projects will require some additional organisation to make them easier to work with. An ideal organisation for a more complex project would look like this:

  • A README.md file that describes the overall project and where to get started. It's helpful to include graphical summary of the interlocking pieces of the project.
  • Script files with reusable functions go in the R/ directory. This is often a small part of an analysis but itโ€™s important.
  • Raw data files live in the data/ directory. If your data are very large, it may be worthwhile to include small sample dataset that so that people can try out the techniques without having to run very expensive computations
  • Analysis scripts and reports files go in the analysis/ directory. The analysis/ directory could include either an R markdown file, a makefile or a makefile.R file that controls the order of the code. In many cases it will be useful to give the analysis scripts ascending names, e.g. 001-load.R, 002-clean.R etc (but this only gives a linear ordering, it doesn't capture the full tree of dependencies in the way a makefile does)
  • A DESCRIPTION file that gives structured, machine- and human-readable information about the authors, licensing, the software dependencies and other metadata of the compendium. When a DESCRIPTION file is included along with the other items above, then the compendium is also a formal R package and you can take advantage of many time-saving tools for package development, testing and sharing (for example, the devtools package).

If youโ€™re familiar with R packages, youโ€™ll notice many similarities with these conventions. But there are some differences:

  • Stand alone package vignettes are often used as the manuscript file, but may not to the best way to organise complex computation because they donโ€™t make the ordering explicit (or give the big picture). A separate analysis/ or manuscript/ directory is more helpful here, and many people use a makefile to organise file dependencies
  • Documentation and testing tend to be less important for compendia than packages - they are still important, but they tend to come later in the process, and are used by relatively more advanced/experienced compendia users.

More complex research compendia include other package elements such as a licence, tests, continuous integration, and dependencies external to R, such as a dockerfile to replicate the computational environment that the analyses were originally conducted in.

How to share a research compendium

You should prepare your compendium using a version control system such as git. Then when you are ready to share it, the best way is to archive a specific commit of your compendium at a repository that issues permanent URLs such as figshare or zenodo which give DOIs for archived files. Then you can circulate the version of your compendium that is the version that generated the published results. This means you have a publicly available snapshot of the code that matches the paper. Code development can continue after the paper is published, but with a DOI that links to a specific commit, other users of the code can be confident that they have the version that matches the paper. A DOI also simplifies citation of the compendium, so you can cite it in your paper (and others can cite it in their work) using a persistent URL.

Putting your compendium on dropbox or google drive is another way to make the compendium easily available.

Getting started with a research compendium

  • Start simple - itโ€™s ok to have just one R script or one R markdown file. But as you get more complex and start to break into multiple files, that you should follow these simple conventions described above
  • A simple example of a research compendium might look like this:
project
|- DESCRIPTION          # project metadata and dependencies 
|- README.md            # top-level description of content and guide to users
|
|- data/                # raw data, not changed once created
|  +- my_data.csv       # data files in open formats such as TXT, CSV, TSV, etc.
|
|- analysis/            # any programmatic code 
|  +- my_scripts.R      # R code used to analyse and visualise data 

A real-world example of this simple research compendium format is online here: https://github.com/duffymeg/BroodParasiteDescription

  • An intermediate example might look like this:
project
|- DESCRIPTION          # project metadata and dependencies 
|- README.md            # top-level description of content and guide to users
|- NAMESPACE            # exports R functions in the package for repeated use
|- LICENSE              # specify the conditions of use and reuse of the code, data & text
|
|- data/                # raw data, not changed once created
|  +- my_data.csv       # data files in open formats such as TXT, CSV, TSV, etc.
|
|- analysis/            # any programmatic code 
|  +- my_report.Rmd     # R markdown file with R code and narrative text interwoven
|
|- R/                   #  
|  +- my_functions.R    # custom R functions that are used more than once in the project
|
|- man/
|  +- my_functions.Rd   # documentation for the R functions (auto-generated when using devtools)

This intermediate example includes the R/ and man/ directories. These contain custom functions that are used repeatedly throughout the project. The man/ directory contains the manual, or documentation on the use of the functions. The NAMESPACE and LICENSE files are also typical features of R packages.

For example, https://github.com/USEPA/LakeTrophicModelling has much of the repeatable code in R/ and the remainder of the code and text in vignettes/manuscript.Rmd.

  • As your project becomes more complex, it's ok to add logically-named subdirectories to keep files organised. There are very few strict rules here, the key principle is to keep your compendium logically organised so that another person can easily understand how your files relate to each other without having to ask you.

  • Naming objects is notoriously difficult to do well, so it's worth to put some effort into a logical and systematic file naming convention if you have a complex project with many files and directories (for example, a multi-experiment study where each experiment has numerous data and code files).

  • A more complex research compendium might look like this:

project
|- DESCRIPTION          # project metadata and dependencies 
|- README.md            # top-level description of content and guide to users
|- NAMESPACE            # exports R functions in the package for repeated use
|- LICENSE              # specify the conditions of use and reuse of the code, data & text
|- .travis.yml          # continuous integration service hook for auto-testing at each commit
|- dockerfile           # makes a custom isolated computational environment for the project
|
|- data/                # raw data, not changed once created
|  +- my_data.csv       # data files in open formats such as TXT, CSV, TSV, etc.
|
|- analysis/            # any programmatic code
|  +- my_report.Rmd     # R markdown file with narrative text interwoven with code chunks 
|  +- makefile          # builds a PDF/HTML/DOCX file from the Rmd, code, and data files
|  +- scripts/          # code files (R, shell, etc.) used for data cleaning, analysis and visualisation 
|
|- R/                     
|  +- my_functions.R    # custom R functions that are used more than once throughout the project
|
|- man/
|  +- my_functions.Rd   # documentation for the R functions (auto-generated when using devtools)
|
|- tests/
|  +- testthat.R        # unit tests of R functions to ensure they perform as expected

Real-world examples that are similar to this more complex research compendium format are online here:

Note that although these real-world examples have a common basic R package structure, they show quite a bit of variation in the location of things like the dockerfile, and the use of package features like the inst/ and vignettes/ directories. This kind of variation does not affect the function of the compendium as a package, and largely reflects personal choices about what kind of file organisation makes the most sense to each researcher.

Useful tools and templates for making research compendia

These templates are empty packages that show various ways of organising an analysis as an R package (eg. where the manuscript is the package vignette, or similarly bundled with the package)

  • For writing papers in R markdown, useful packages include include captioner and kfigr for figure and table captions and cross-referencing. There are many R markdown templates in the rticles package, these make is easy to get started with formatting and citations to produce attractive PDF/HTML/DOCX output from R markdown documents.

  • If you start with a single R markdown document and want to develop it into a package, see the rlp package, which has functions for this purpose.

  • For capturing the computational environment of an analysis, rocker is a project that provides Docker containers to run R in a lightweight virtual environment. The hadleyverse container includes dplyr, ggplot2, etc., as well as RStudio server and LaTeX. The package harbor provides functions for controlling docker containers on local and remote hosts. The analogsea package has functions for deploying R and RStudio quickly & easily on DigitalOcean clusters using Docker images for cloud computing. The dockertest package contains functions for generating Dockerfiles from R packages and other R projects, and building Docker containers that contains all the package dependencies.

  • For complex workflows where you only want to run components that have changed (eg. because of long compute times), or need to run a series of R scripts in a specific order, you may find make useful. Make can be used to automatically execute sequences of analyses and update any set of files that depend on another set of files. This makes it a good solution for many data analysis and data management problems, including the generation of images from data. The remake package allows you to write makefile-like files entirely within R, saving you from having to learn make's language.

Challenges for future work

  • We could do with a package like devtools that automates the common problems and handles typical research workflow issues
  • We all need to try out the R package structure with our actual research projects and report back in one year so see what works, what doesn't, and where the pain points are.
  • Would be useful to share these processes in public so that people have can good examples that they can look at

Further reading

rOpenSci Guide to Reproducible Research

Gandrud C 2013 Reproducible Research with R and RStudio. CRC Press Florida

Gentleman, R. and Temple Lang, D. (2007). Statistical analyses and reproducible research. Journal of Computational and Graphical Statistics 16, 1โ€“23

Gentleman, R. and Temple Lang, D. "Statistical Analyses and Reproducible Research" (May 2004). Bioconductor Project Working Papers.

Stodden, V and Miguez, S 2014. Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research. Journal of Open Research Software 2(1):e21, DOI: http://dx.doi.org/10.5334/jors.ay

Wickham, Hadley, R Packages: Organise, test, document, and share your code. Oโ€™Reilly.

Colophon

This document was the result of discussions at the 2015 rOpenSci unconference (cf. ropensci/unconf15#11 and ropensci/unconf15#31). Contributors to the discussion include... [if you were in the rOpenSci unconf breakout on this topic please add your name via a Pull Request]. This document was initially drafted by Hadley Wickham, with later contributions from Ben Marwick. Additional contributions are welcome! Please post an issue to ask questions and discuss suggestions.

More Repositories

1

drake

An R-focused pipeline toolkit for reproducibility and high-performance computing
R
1,339
star
2

skimr

A frictionless, pipeable approach to dealing with summary statistics
HTML
1,108
star
3

targets

Function-oriented Make-like declarative workflows for R
R
912
star
4

rtweet

๐Ÿฆ R client for interacting with Twitter's [stream and REST] APIs
R
785
star
5

tabulizer

Bindings for Tabula PDF Table Extractor Library
R
518
star
6

pdftools

Text Extraction, Rendering and Converting of PDF Documents
C++
489
star
7

magick

Magic, madness, heaven, sin
R
440
star
8

visdat

Preliminary Exploratory Visualisation of Data
R
439
star
9

stplanr

Sustainable transport planning with R
R
417
star
10

RSelenium

An R client for Selenium Remote WebDriver
R
332
star
11

rnoaa

R interface to many NOAA data APIs
R
328
star
12

osmdata

R package for downloading OpenStreetMap data
R
315
star
13

charlatan

Create fake data in R
R
291
star
14

software-review

rOpenSci Software Peer Review.
R
279
star
15

iheatmapr

Complex, interactive heatmaps in R
R
259
star
16

taxize

A taxonomic toolbelt for R
R
250
star
17

elastic

R client for the Elasticsearch HTTP API
R
244
star
18

tesseract

Bindings to Tesseract OCR engine for R
R
236
star
19

git2r

R bindings to the libgit2 library
R
216
star
20

qualtRics

Download โฌ‡๏ธ Qualtrics survey data directly into R!
R
215
star
21

biomartr

Genomic Data Retrieval with R
R
212
star
22

writexl

Portable, light-weight data frame to xlsx exporter for R
C
202
star
23

googleLanguageR

R client for the Google Translation API, Google Cloud Natural Language API and Google Cloud Speech API
HTML
194
star
24

rnaturalearth

An R package to hold and facilitate interaction with natural earth map data ๐ŸŒ
R
191
star
25

textreuse

Detect text reuse and document similarity
R
188
star
26

piggyback

๐Ÿ“ฆ for using large(r) data files on GitHub
R
182
star
27

tokenizers

Fast, Consistent Tokenization of Natural Language Text
R
179
star
28

rentrez

talk with NCBI entrez using R
R
178
star
29

rcrossref

R client for various CrossRef APIs
R
166
star
30

osmextract

Download and import OpenStreetMap data from Geofabrik and other providers
R
166
star
31

dataspice

๐ŸŒถ๏ธ Create lightweight schema.org descriptions of your datasets
R
159
star
32

rgbif

Interface to the Global Biodiversity Information Facility API
R
155
star
33

tic

Tasks Integrating Continuously: CI-Agnostic Workflow Definitions
R
153
star
34

webchem

Chemical Information from the Web
R
149
star
35

geojsonio

Convert many data formats to & from GeoJSON & TopoJSON
R
148
star
36

tsbox

tsbox: Class-Agnostic Time Series in R
R
148
star
37

MODIStsp

An "R" package for automatic download and preprocessing of MODIS Land Products Time Series
R
147
star
38

ghql

GraphQL R client
R
145
star
39

DataPackageR

An R package to enable reproducible data processing, packaging and sharing.
R
145
star
40

dev_guide

rOpenSci Packages: Development, Maintenance, and Peer Review
R
141
star
41

osfr

R interface to the Open Science Framework (OSF)
R
140
star
42

jqr

R interface to jq
R
139
star
43

tarchetypes

Archetypes for targets and pipelines
R
130
star
44

osmplotr

Data visualisation using OpenStreetMap objects
R
130
star
45

opencv

R bindings for OpenCV
C++
130
star
46

ssh

Native SSH client in R based on libssh
C
126
star
47

RefManageR

R package RefManageR
R
114
star
48

ezknitr

Avoid the typical working directory pain when using 'knitr'
R
112
star
49

spocc

Species occurrence data toolkit for R
R
109
star
50

hunspell

High-Performance Stemmer, Tokenizer, and Spell Checker for R
C++
106
star
51

weathercan

R package for downloading weather data from Environment and Climate Change Canada
R
102
star
52

crul

R6 based http client for R (for developers)
R
102
star
53

UCSCXenaTools

๐Ÿ“ฆ An R package for accessing genomics data from UCSC Xena platform, from cancer multi-omics to single-cell RNA-seq https://cran.r-project.org/web/packages/UCSCXenaTools/
R
102
star
54

gistr

Interact with GitHub gists from R
R
101
star
55

spelling

Tools for Spell Checking in R
R
101
star
56

rfishbase

R interface to the fishbase.org database
R
100
star
57

gutenbergr

Search and download public domain texts from Project Gutenberg
R
99
star
58

git2rdata

An R package for storing and retrieving data.frames in git repositories.
R
99
star
59

openalexR

Getting bibliographic records from OpenAlex
R
98
star
60

bib2df

Parse a BibTeX file to a tibble
R
97
star
61

ckanr

R client for the CKAN API
R
97
star
62

nasapower

API Client for NASA POWER Global Meteorology, Surface Solar Energy and Climatology in R
R
96
star
63

rsvg

SVG renderer for R based on librsvg2
C
95
star
64

EML

Ecological Metadata Language interface for R: synthesis and integration of heterogenous data
R
94
star
65

FedData

Functions to Automate Downloading Geospatial Data Available from Several Federated Data Sources
R
94
star
66

cyphr

:shipit: Humane encryption
R
93
star
67

GSODR

API Client for Global Surface Summary of the Day (GSOD) Weather Data Client in R
R
90
star
68

mapscanner

R package to print maps, draw on them, and scan them back in
R
88
star
69

av

Working with Video in R
C
88
star
70

opencage

๐ŸŒ R package for the OpenCage API -- both forward and reverse geocoding ๐ŸŒ
R
87
star
71

gittargets

Data version control for reproducible analysis pipelines in R with {targets}.
R
85
star
72

tidync

NetCDF exploration and data extraction
R
85
star
73

historydata

Datasets for Historians
R
83
star
74

rzmq

R package for ZMQ
C++
82
star
75

CoordinateCleaner

Automated flagging of common spatial and temporal errors in biological and palaeontological collection data, for the use in conservation, ecology and palaeontology.
HTML
79
star
76

rebird

Wrapper to the eBird API
R
79
star
77

smapr

An R package for acquisition and processing of NASA SMAP data
R
79
star
78

bikedata

๐Ÿšฒ Extract data from public hire bicycle systems
R
79
star
79

dittodb

dittodb: A Test Environment for DB Queries in R
R
78
star
80

arkdb

Archive and unarchive databases as flat text files
R
78
star
81

fingertipsR

R package to interact with Public Health Englandโ€™s Fingertips data tool
R
78
star
82

vcr

Record HTTP calls and replay them
R
77
star
83

nodbi

Document DBI connector for R
R
76
star
84

opentripplanner

An R package to set up and use OpenTripPlanner (OTP) as a local or remote multimodal trip planner.
R
73
star
85

nlrx

nlrx NetLogo R
R
71
star
86

slopes

Package to calculate slopes of roads, rivers and trajectories
R
70
star
87

tidyhydat

An R package to import Water Survey of Canada hydrometric data and make it tidy
R
70
star
88

rb3

A bunch of downloaders and parsers for data delivered from B3
R
69
star
89

robotstxt

robots.txt file parsing and checking for R
R
68
star
90

codemetar

an R package for generating and working with codemeta
R
66
star
91

tradestatistics

R package to access Open Trade Statistics API
R
65
star
92

unconf17

Website for 2017 rOpenSci Unconf
JavaScript
64
star
93

roadoi

Use Unpaywall with R
R
64
star
94

terrainr

Get DEMs and orthoimagery from the USGS National Map, georeference your images and merge rasters, and visualize with Unity 3D
R
64
star
95

tiler

Generate geographic and non-geographic map tiles from R
R
64
star
96

comtradr

Functions for Interacting with the UN Comtrade API
R
64
star
97

NLMR

๐Ÿ“ฆ R package to simulate neutral landscape models ๐Ÿ”
R
63
star
98

parzer

Parse geographic coordinates
R
63
star
99

rWBclimate

R interface for the World Bank climate data
R
62
star
100

stats19

R package for working with open road traffic casualty data from Great Britain
R
61
star