Best Practices for Data Science Project Workflows and File Organization
To ensure replicability and traceability of scientific claims, it is essential that the scientific publication, the corresponding data sets, and the data analysis are made publicly available. All software should be openly shared and directly accessible to others. Publications should highlight that the code is freely available in, for example, Github. This openness can increase the chances of getting the paper accepted for publication. Journal editors and reviewers receive the opportunity to reproduce findings during the manuscript review process, increasing confidence in the reported results. Once the paper is published, your work can be reproduced by other member of the scientific community, which can increase citations and foster opportunities for further discussion and collaboration.
Basic workflow of data analysis
Not every data analysis is the same and but this is a useful template for what the pieces of a data analysis are and how they flow together.
- Define question
- Determine what data you can access
- Obtaining the data
- Cleaning the data
- Exploratory data analysis
- Statistical prediction/modeling
- Interpretation of results
- Challenging of results
- Synthesis and write up
- Creating reproducible code
Creating a project skeleton
Linux
Create a project directory and nested sub directories with a one-liner
in the Terminal
:
mkdir -p hsapiens-snps/{data/{raw,tidy},refs,src,bin,analysis,figures} # no spaces between subdirectories
- All of your raw data goes into
raw
# Compress (gzip file.fasta
) and read-only (chmod 400 file.fasta
). - All of your post processed data goes into
tidy
- Reference Data (e.g. Genomes/Transcriptomes) go into
refs
- Source code for downloaded tools go into
src
- Scripts you’ve written and compiled binaries go into
bin
- All of the intermediary files generated, should go into
analysis
folder.- You can then logically separate sub projects
(e.g.
sequencing-data-qc
,alignment-results-qc
,diff-exp-analysis
, etc.)
- You can then logically separate sub projects
(e.g.
R
There are a few differences between standard skeletons on terminal and
R
.
- A
bin
folder is not required forR
projects - Scripts should be kept in the
R
directory. Ideally you should have one script to run everything if possible in the main directory or aMakefile
- A
man
directory that contains the manual (i.e. documentation) for the use of the functions in your package (optional directory if usingroxygen2
, you should!) - Github is not designed for hosting large data sets, either host them externally or provide a small-sample data set that people can try out the techniques without having to run very expensive computations
Create New Project
with RStudio
(e.g. hsapiens-snps) and then
create the sub folders
library(purrr)
library(here)
folder_names <- c("data/raw", "data/tidy", "refs", "R", "analysis", "figures", "man")
#sapply(folder_names, dir.create) # base R way
map(folder_names, dir.create) #purrr-fect way
In R
use the file.create()
function. To verify that the file has
been created use list.files()
. Use file.rename()
function to change
names or the file.remove()
to remove files, and file.exists()
to
check if a file exists. You can use the dir.create()
function to
create directories in R
as well.
If you project is only based in R
you should create one folder per
project with the files organized like this:
name_of_project
|--data
|--raw
|--file_001.xlsx
|--file_002.gen
|--file_002.sample
|--tidy
|--file_001-cleaned.csv
|--refs
|--Oldach_2018.pdf
|--analysis
|--01-analysis.Rmd
|--figures
|--01-scatterplot.jpg
|--01-correlation.png
|--R
|--exploratory_analysis.R
|--pdf_scraper.R
|--name_of_project.Rproj
|--run_all.R
File naming
- Avoid spaces:
2018 eda report.md
is not a good name but2018_eda_report.md
is. - Avoid punctuation, periods and other special characters except for underscores and dashes.
- Always use lower case:
Esophogael-Cancer_Report.md
is not a good name butesophogael-cancer_report.md
is. - Use leading zeros (left-pad) for correct ordering of files with
ls
: Usefile-021.txt
NOTfile21.txt
.- The numbers are determined by how many files you will potentially have. So if you may not have more than 1000 files you can choose three digits. If not more than a hundred files choose two digits and so on.
Create dated directories/files using the command date +%F
for ISO 8601
(yyyy-mm-dd
). If you keep a digital notebook you can Ctrl+F
to look
for something see what date you did it and easily find that sub
directory.
mkdir results-$(date +%F)
Git with Github
For every project you should create a repository on Github (Gitlab, Bitbucket, SourceForge, etc.) to take advantage of branches & pull requests, code snapshots, tracking project bugs and enhancements using issues, and dissemination of the final results. The recommendations below should be broadly applicable to most repository hosting services. These recommendations take full advantage of Github’s features for managing and promoting projects in bioinformatics as well as in many other research domains.
Github repositiores are freely visible to all. Many projects decide to share their work publicly and openly from the start of the project in order to attract visibility and to benefit from contributions from the community early on. Private “repos” on Github/Gitlab are used to ensure that work is hidden but also limit collaborations to just those users who are given access to the repository. These repos can then be made public at a later stage, for example upon submission, acceptance, or publicaiton of corresponding journal atricles.
A useful gist for setting up version control and Github can be found here and the above was taken from the great introduction to Git/Github by Perez-Riverol et al., 2016.
Every repository should have ideally the following three files (in addition to the code):
- The
LICENSE
file that clearly defines the permissions and restrictions attached to the code and other files in your repository. - The
README
file which provides a short description of the project, a quick start guide, information on how to contribute, a TODO list, and links to additional documentation. - The
CITATION
file to the repository informs your users how to cite and credit your project.
LICENSE
file
The Common licenses such as those listed on http://choosealicense.com are
preferred. The LICENSE
file in the repository should be a plain-text
file containing the contents of an Open Source Initiative
(OSI)-approved license,
not just a reference to the license.
README
file
The A README.md
file that describes the overall project and where to get
started. It may be helpful to include a graphical summary of the
interlocking pieces of the project.
In each directory explaining how code is organized (e.g.
data/README.txt
contains metadata about all data files in data
directory). For each file name, I recommend to have a short description
of what the file is for. Document your methods/workflow
, definition of
code and symbols used to deal with missing data, include URL
’s for
references to publication or external data and document when you
download data (database release number/version number/names, etc.) and
if you used MySQL
or UCSC Genome browser
to download data.
Update the README
file as you work. As you create new files and
folders, add their descriptions to the README
. Include a list of
variables, units of measurement of each variable, and information about
how others should cite your analysis.
Browse some examples of awesome README
s
here!
DESCRIPTION
file
The A DESCRIPTION
file in the project root provides formally structured,
machine- and human-readable information about the authors, the project
license, the software dependencies, and other metadata of the
compendium. When yore compendium is an R package, you can take advantage
of many time-saving tools for package development, testing, and sharing
(e.g. the devtools
package).
NAMESPACE
file
The A NAMESPACE
file in the project root provides that provides the
exports the functions in a package, so that they can be used in the
current project and by other packages.
Creating a research commpendium
Temple, Lang and Gentleman (2007) advanced the proposal for using the R package structure as a “Research Compendium,” an idea that has since caught on with many others (Marwick et al., 2018).
- Papers with publicly available data sets may receive a higher number of citations than similar studies without available data
- Data sharing is associated with higher publication productivity
- A compendium makes it easier to communicate our work with others, including collaborators and (not least of all) our future selves
It may be tempting to organize all scripts and reports with ascending
names, for example 001-load.R
, 002-clean.R
etc. and although it
helps with organization it does not capture the full tree of
dependencies. In the terminal it would be best to use a Makefile
to
manage more complex workflows as they capture the full tree of
dependencies and control the order of code execution.
On R
I would strongly recommend using the Drake
package which I’ve talked about
recently: “Getting Started with Drake in
R”.
If your workflow combines R
code, terminal
and python
I would
suggest using snakemake
or
Airflow
.
(TO DO: look into the reticulate
package and see how complex
multi-language workflows are managed)
Examples of real-world research compendia using R packages
The Boettiger and colleagues example provides a more complex example of
a large compendia. This example includes a .travis.yml
, a
configuration file for the Travis continuous integration service, a
Dockerfile
that instructs the Docker software how to create a virtual
environment for running the R code in an isolated and portable context,
and the Makefile
which has instructions for executing R code in the
compendium in a way that avoids unnecessary repetition. These three
files contain organizing metadata that are intended to be both machine-
and human-readable, but are not written in the R language.
Unit tests, contained in the tests
directory, are R code scripts to
test that specific functions in the compendium produce their expected
outputs, given known inputs. The addition of these tools, initially
developed for software engineering, solves the problems of specifying
the computational environment and the relationships between data, code,
and output.
The top-level manuscripts
directory of Boettiger et al., holds the
files that generate the journal article and the supplementary documents,
as well as the Makefile
and Dockerfile
. Other notable items at the
top-level of this compendium are the .drone.yml
, .zenodo.json
and
.traivs.yml
files. The drone file contains configuration details for
the Drone continuous integration service that operates in Docker
containers. In this case, each time a commit is made to the repository,
the Drone web service automatically renders the manuscript files
(including running the R code in the manuscript) to PDF in a Docker
container defined by the Dockerfile
in manuscripts/
. This provides
an automatic check to see if a PDF can be generated from the manuscript
R markdown files after the last commit. The .travis.yml
file performs
a similar purpose, but is focused on whether or not the R package can be
successfully build in a generic Linux environment. The .zenodo.json
file provides machine-readable metadata about the compendium. This is
useful for automatically archiving the compendium at zenodo.
Automatically create project skeletons in R and integrate
The starters
package on
Github provides a number of functions, leveraging the usethis
package,
designed to take away some of the grunt work involved in setting up
new projects. The createAnalysisProject()
function creates a project
ready for a typical analysis project with packrat
to manage
dependencies to keep things consistent and reproducible, while the
createPackageProject()
helps setup a project with code coverage,
vignettes, unit testing etc. out of the box.
Dedicated code testing is aimed at detecting possible bugs introduced by
new features or changes in the code or dependencies, as well as
detecting wrong results, known as logic errors, in which the source
code produces a different result than what was intended. Continuous
integration provides a way to automatically and systematically run a
series of tests to check integrity and performance of code, a task that
can be automated through Github. starters
sets up [Travis
CI](https://Travis-ci.org/,
a hosted continued integration platform that is free for all open-source
projects. Travis CI builds and test the source code using a plethora of
options such as different platforms and interpreter versions.
By organizing files into an R package, you follow conventions that save
you time thinking about the best way to organize your project. Writing
functions for packages makes it easy to document the use of code
(particularly if you use roxygen2
). This documentation can help you be
productive more quickly when returning to work on a project after
stepping away from it.
As a project grows in size and collaboration, having the more rigid structure offered by a package format becomes increasingly valuable. Packages can automate installation of dependencies, perform checks that changes have not broken code, and provide a modular and highly tested framework for collecting together the three essential elements: data, code, and manuscript-quality documentation, in a powerful and feature-rich environment.
When you iteratively develop a package, you can easily run comprehensive checks on your code with
R CMD check
at the terminal ordevtools::check()
at the R console. Regular checking your package like this can help identify problems before they become very frustrating to solve.
Packages are meant to be distributed, so viewing a project as a package means preparing your project to be distributed. This can help keep you honest when programming; you’re less likely to take shortcuts when you plan on releasing it to the world. Furthermore, others can start using the tools you made and that can earn you citations or even an additional paper in J. Stat. Soft. or the R Journal.
RMarkdown
and bookdown
)
Template for scholarly writing (Another tool I have not yet tried yet, but looks promising, is
rrtools
. It provides a
convenient starting point for writing a journal article, report, or
thesis.
Publish your analysis to a website
The traditional way to promote scientific software (and manuscripts) is by publishing a paper in the peer-reviewed scientific literature. Additional steps can boost the visibility of an organization. For example Github Pages are simple websites freely hosted by Github. User can create and host blog websites, help pages, manuals, tutorials, and websites related to specific projects. Pages come with a powerful static site generator called Jekyll that can be integrate with other frameworks as Boostrap or platforms such as Disqus to support and moderate comments.
There are also real-time communication systems such as Gitter that allows the user community, developers, and project collaborators to exchange ideas and issues to report bugs or get support. Within a Gitter chat, members can reference issues, comments, and pull requests. Github also supports wikis for each repository.
The workflowr
package
combines literate programming (knitr
and rmarkdown
) and version
control (Git
via git2r
) to generate a website containing
time-stamped, versioned, and documented results.
Blog Post
A much shorter version of this article appeared originally on Medium
Contact information
Matthew J. Oldach
- E-mail: [email protected]
- Twitter: MattOldach
- Website: https://moldach.github.io/
Acknowledgments
- A Quick Guide to Organizing Computational Biology Projects
- Best Practices for Scientific Computing
- Good Enough Practices in Scientific Computing
- The Biostar Handbook
- Organizing Data Science Projects
- Designing projects, by Rich Fitzjohn
- How I Start A Bioinformatics Project, by Nick Lowman
- Report Writing for Data Science in R, by Roger D. Peng
- here, here, by Jenny Bryan
- here documentation, by Kirill Muller
- File organization best practices by Andrew Tran
- ropensci/rrrpkg: tools and templates for making research compendia
- How to share data with a statistician by Jeff Leek
- Ten Simple Rules for Taking Advantage of Github