• Stars
    star
    175
  • Rank 218,059 (Top 5 %)
  • Language
    Rust
  • License
    MIT License
  • Created over 1 year ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Command line scientific data management tool

Crates.io Crates.io CI tests

SciDataFlow logo

SciDataFlow — Facilitating the Flow of Data in Science

SciDataFlow demo screencast

Problem 1: Have you ever wanted to reuse and build upon a research project's output or supplementary data, but can't find it?

SciDataFlow solves this issue by making it easy to unite a research project's data with its code. Often, code for open computational projects is managed with Git and stored on a site like GitHub. However, a lot of scientific data is too large to be stored on these sites, and instead is hosted by sites like Zenodo or FigShare.

Problem 2: Does your computational project have dozens or even hundreds of intermediate data files you'd like to keep track of? Do you want to see if these files are changed by updates to computational pipelines.

SciDataFlow also solves this issue by keeping a record of the necessary information to track when data is changed. This is stored alongside the information needed to retrieve data from and push data to remote data repositories. All of this is kept in a simple YAML "Data Manifest" (data_manifest.yml) file that SciDataFlow manages. This file is stored in the main project directory and meant to be checked into Git, so that Git commit history can be used to see changes to data. The Data Manifest is a simple, minimal, human and machine readable specification. But you don't need to know the specifics — the simple sdf command line tool handles it all for you.

The SciDataFlow manuscript has been published in Bioinformatics. If you use SciDataFlow, please consider citing it:

V. Buffalo, SciDataFlow: A Tool for Improving the Flow of Data through Science. 
Bioinformatics (2024), doi:10.1093/bioinformatics/btad754.

The BibTeX entry can be accessed by clicking "Cite this repository" on the right side of the main GitHub repository page.

Documentation

SciDataFlow has extensive documentation full of examples of how to use the various subcommands.

SciDataFlow's Vision

The larger vision of SciDataFlow is to change how data flows through scientific projects. The way scientific data is currently shared is fundamentally broken, which prevents the reuse of data that is the output of some smaller step in the scientific process. We call these scientific assets.

Scientific Assets are the output of some computational pipeline or analysis which has the following important characteristic: Scientific Assets should be reusable by everyone, and be reused by everyone. Being reusable means all other researchers should be able to quickly reuse a scientific asset (without having to spend hours trying to find and download data). Being reused by everyone means that using a scientific asset should be the best way to do something.

For example, if I lift over a recombination map to a new reference genome, that pipeline and output data should be a scientific asset. It should be reusable to everyone — we should not each be rewriting the same bioinformatics pipelines for routine tasks. There are three problems with this: (1) each reimplementation has an independent chance of errors, (2) it's a waste of time, (3) there is no cumulative improvement of the output data. It's not an asset; the result of each implementation is a liability!

Lowering the barrier to reusing computational steps is one of SciDataFlow's main motivations. Each scientific asset should have a record of what computational steps produced output data, and with one command (sdf pull) it should be possible to retrieve all data outputs from that repository. If the user only wants to reuse the data, they can stop there — they have the data locally and can proceed with their research. If the user wants to investigate how the input data was generated, the code is right there too. If they want to try rerunning the computational steps that produced that analysis, they can do that too. Note that SciDataFlow is agnostic to this — by design, it does not tackle the hard problem of managing software versions, computational environments, etc. It can work alongside software (e.g. Docker or Singularity) that tries to solve that problem.

By lowering the barrier to sharing and retrieving scientific data, SciDataFlow hopes to improve the reuse of data.

Future Plans

In the long run, the SciDataFlow YAML specification would allow for recipe-like reuse of data. I would like to see, for example, a set of human genomics scientific assets on GitHub that are continuously updated and reused. Then, rather than a researcher beginning a project by navigating many websites for human genome annotation or data, they might do something like:

$ mkdir -p new_adna_analysis/data/annotation
$ cd new_adna_analysis/data/annotation
$ git clone [email protected]:human_genome_assets/decode_recmap_hg38
$ (cd decode_recmap/ && sdf pull)
$ git clone [email protected]:human_genome_assets/annotation_hg38
$ (cd annotation_hg38 && sdf pull)

and so forth. Then, they may look at the annotation_hg38/ asset, find a problem, fix it, and issue a GitHub pull request. If the change is fixed, the maintainer would then just do sdf push --overwrite to push the data file to the data repository. Then, the Scientific Asset is then updated for everyone to use an benefit from. All other researchers can then instantly use the updated asset; all it takes is a mere sdf pull --overwrite.

Installing SciDataFlow

If you'd like to the Rust Programming Language manually, see this page, which instructs you to run:

$ curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

Then, to install SciDataFlow, just run:

$ cargo install scidataflow

To test, just try running sdf --help.

Reporting Bugs

If you are a user of SciDataFlow and encounter an issue, please submit an issue to https://github.com/vsbuffalo/scidataflow/issues!

Contributing to SciDataFlow

If you are a Rust developer, please contribute! Here are some great ways to get started (also check the TODO list below, or for TODOs in code!):

  • Write some API tests. See some of the tests in src/lib/api/zenodo.api as an example.

  • Write some integration tests. See tests/test_project.rs for examples.

  • A cleaner error framework. Currently SciDataflow uses anyhow, which works well, but it would be nice to have more specific error enums.

  • Improve the documentation!

Todo

  • [] sdf mv tests, within different directories.

More Repositories

1

bds-files

Supplementary files for my book, "Bioinformatics Data Skills"
Python
578
star
2

devnotes

Vince Buffalo's devnotes — ½ TIL, ½ notebook
HTML
118
star
3

granges

A Rust library and command line tool for working with genomic ranges and their data.
Rust
94
star
4

scythe

A 3'-end adapter contaminant trimmer
C
90
star
5

dotfiles

My dotfiles, with simple install script.
Vim Script
53
star
6

gmtfPDF

A Google Chrome extension that modifies journal links so they give you the F'ing PDF.
JavaScript
46
star
7

bioawk-tutorial

40
star
8

makefiles-in-bioinfo

A talk on Makesfiles in bioinformatics
26
star
9

qrqc

Quick Read Quality Control
R
20
star
10

gplyr

R
18
star
11

msr

Process MS results in R, in a tidy way
R
17
star
12

BioRanges

A small library for ranges/intervals, for use with genomic data.
Python
16
star
13

snakemake-tutorial

Snakemake tutorial materials
Python
16
star
14

genomap

A Rust library for storing generic genomic data by sorted chromosome name
Rust
16
star
15

rna-seq-example

An analysis of Arabidopsis RNA-seq data (hy5 mutant and wt, two replicates each; SRA accession SRX029582)
Shell
16
star
16

stanhl

Stan syntax highlighting for knitr
R
14
star
17

seqqs

seqqs is a C program/library for gathering quality statistics from sequencing data
C
12
star
18

good-news-everyone

Good news everyone! You can have Futurama in your shell again.
11
star
19

sam2counts

Count number of mapped reads per reference in SAM files (often for RNA-Seq experiments)
Python
11
star
20

slimflow

Python
11
star
21

findorf

ORF prediction of de novo transcriptome assemblies
Python
10
star
22

dev-sea-el

command line access to devtools
R
10
star
23

vincebuffalo-website

This is the source of my website
HTML
9
star
24

eidos.vim

A minimal syntax highlighting plugin for Vim for SLiM's edios language
Vim Script
8
star
25

cvtk

Jupyter Notebook
7
star
26

RNASeqTools

Diagnostics for doing RNA-seq in R
R
7
star
27

bamslider

Sliding windows in BAM/SAM files with Python's deques
Python
6
star
28

alignerviz

A simple tool to visualize sequences using Vlachos et al's technique
Python
6
star
29

coaljs

d3 neutral coalescent genealogies
JavaScript
6
star
30

git-demo

A very quick demo of Git for beginners
6
star
31

readphaser

experimental read phasing from HapCut
Python
6
star
32

remote_jupyter_py

Management of remote Jupyter sessions
Python
6
star
33

recmap

A command line tool and Rust library for working with recombination maps.
Rust
6
star
34

r-bioinfo-workshop

R/Bioinformatics Workshop
HTML
5
star
35

annotatr

R
5
star
36

bprime

Jupyter Notebook
5
star
37

tasselr

Some beta interfaces to Tassel's GBS data on HDF5
R
5
star
38

pathfindr

R
4
star
39

flowerpower

R pakage to access Parrot's FlowerPower plant sensor data
R
4
star
40

blast2cap3

A tool for merging transcriptome assemblies via protein homology
Python
4
star
41

genomicranges-intro

4
star
42

.emacs.d

My .emacs.d directory
Emacs Lisp
4
star
43

rivr

R
4
star
44

mll_translocation

Tools and notes for finding translocations in the human MLL gene
R
3
star
45

angsdr

load ANGSD data in R
R
3
star
46

samfilter

A tiny SAM/BAM filter
Python
3
star
47

TxDb.Zmays.Ensembl.AGPv2.17

Transcript database for Zea mays Ensembl AGPv2.17
R
3
star
48

ssbf

streaming sequence bloom filter
C
3
star
49

ms-ld

Example LD simulation with recombination using ms and libsequence.
C++
3
star
50

paradox_variation

TeX
3
star
51

quotes

A repository of quotes I like.
3
star
52

tempautocov

Code and some data to reproduce Buffalo and Coop (2019)
HTML
3
star
53

ProgenyArray

An R package to work with half-sib progeny array data
R
2
star
54

bioinfo-reading-list

2
star
55

ms

A version of Hudon's MS that compiles on Mavericks
C
2
star
56

maf

An (experimental) mapping assessment framework
Python
2
star
57

slper

Python
2
star
58

hmmr

HMM algorithms in R
R
2
star
59

joy-of-base-graphics

HTML
2
star
60

toy-motif-finder

2
star
61

slirm

A fast way to launch slim jobs on slurm
Python
2
star
62

remote_jupyter

Management of remote Jupyter sessions.
Rust
2
star
63

bam-issorted

C++
1
star
64

mbenchpy

A command line benchmarking utility
Python
1
star
65

rna_seq_talk

1
star
66

mspy

Small Python library for parsing output from Hudson's MS
Python
1
star
67

quality-tutorial

A tutorial on how to use sequence quality improvement and assessment tools
1
star
68

codonstats

R
1
star
69

lifeweeks

A plot of your life's progress, in weeks
Python
1
star
70

eve102

R
1
star
71

zm_unique

C++
1
star