• Stars
    star
    302
  • Rank 138,030 (Top 3 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 10 years ago
  • Updated 10 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

ReproZip is a tool that simplifies the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

Build Status Coverage Status Documentation Status Matrix paper DOI

ReproZip

ReproZip is a tool aimed at simplifying the process of creating reproducible experiments from command-line executions, a frequently-used common denominator in computational science.

It tracks operating system calls and creates a package that contains all the binaries, files and dependencies required to run a given command on the author's computational environment (packing step). A reviewer can then extract the experiment in his environment to reproduce the results (unpacking step).

Quickstart

We have an example repository with a variety of different software. Don't hesitate to check it out, and contribute your own example if use ReproZip for something new!

Packing

Packing experiments is only available for Linux distributions. In the environment where the experiment is originally executed, first install reprozip:

$ pip install reprozip

Then, run your experiment with reprozip. Suppose you execute your experiment by originally running the following command:

$ ./myexperiment -my --options inputs/somefile.csv other_file_here.bin

To run it with reprozip, you just need to use the prefix reprozip trace:

$ reprozip trace ./myexperiment -my --options inputs/somefile.csv other_file_here.bin

This command creates a .reprozip-trace directory, in which you'll find the configuration file, named config.yml. You can edit the command line and environment variables, and choose which files to pack.

If you are using Debian or Ubuntu, most of these files (library dependencies) are organized by package. You can add or remove files, or choose not to include a package by changing option packfiles from true to false. In this way, smaller packs can be created with reprozip (if space is an issue), and reprounzip can download these files from the package manager; however, note this is only available for Debian and Ubuntu for now, and also be aware that package versions might differ. Choosing which files to pack is also important to remove sensitive information and third-party software that is not open source and should not be distributed.

Once done editing the configuration file (or even if you did not change anything), run the following command to create a ReproZip package named my_experiment:

$ reprozip pack my_experiment.rpz

VoilΓ ! Now your experiment has been packed, and you can send it to your collaborators, reviewers, and researchers around the world!

Note that you can open the help message for any reprozip command by using the flag -h.

Unpacking

Do you need to unpack an experiment in a Linux machine? Easy! First, install reprounzip:

$ pip install reprounzip

Then, if you want to unpack everything in a single directory named mydirectory and execute the experiment from there, use the prefix reprounzip directory:

$ reprounzip directory setup my_experiment.rpz mydirectory
$ reprounzip directory run mydirectory

In case you prefer to build a chroot environment under mychroot, use the prefix reprounzip chroot:

$ reprounzip chroot setup my_experiment.rpz mychroot
$ reprounzip chroot run mychroot

Note that the previous options do not interfere with the original configuration of the environment, so don't worry! If you are using Debian or Ubuntu, reprounzip also has an option to install all the library dependencies directly on the machine using package managers (rather than just copying the files from the .rpz package). Be aware that this will interfere in your environment and it may update your library packages, so use it at your own risk! For this option, just use the prefix reprounzip installpkgs:

$ reprounzip installpkgs my_experiment.rpz

What if you want to reproduce the experiment in Windows or Mac OS X? You can build a virtual machine with the experiment! Easy as well! First, install the plugin reprounzip-vagrant:

$ pip install reprounzip-vagrant

Note that (i) you must install reprounzip first, and (ii) the plugin requires having Vagrant installed. Then, use the prefix reprounzip vagrant to create and start a virtual machine under directory mytemplate:

$ reprounzip vagrant setup my_experiment.rpz mytemplate

To execute the experiment, simply run:

$ reprounzip vagrant run mytemplate

Alternatively, you may use Docker containers to reproduce the experiment, which also works under Linux, Mac OS X, and Windows! First, install the plugin reprounzip-docker:

$ pip install reprounzip-docker

Then, assuming that you want to create the container under directory mytemplate, simply use the prefix reprounzip docker:

$ reprounzip docker setup my_experiment.rpz mytemplate
$ reprounzip docker run mytemplate

Remember that you can open the help message and learn more about other available flags and options by using the flag -h for any reprounzip command.

Citing ReproZip

Please use the following when citing ReproZip (BibTeX):

ReproZip: Computational Reproducibility With Ease
F. Chirigati, R. Rampin, D. Shasha, and J. Freire.
In Proceedings of the 2016 ACM SIGMOD International Conference on Management of Data (SIGMOD), pp. 2085-2088, 2016

Contribute

Please subscribe to and contact the [email protected] mailing list for questions, suggestions and discussions about using reprozip.

Bugs and feature plannings are tracked in the GitHub issues. Feel free to add an issue!

To suggest changes to this source code, feel free to raise a GitHub pull request. Any contributions received are assumed to be covered by the BSD 3-Clause license. We might ask you to sign a Contributor License Agreement before accepting a larger contribution.

License

  • Copyright (C) 2014, New York University

Licensed under a BSD 3-Clause license. See the file LICENSE.txt for details.

Links and References

For more detailed information, please refer to our website, as well as to our documentation.

ReproZip is currently being developed at NYU. The team includes:

More Repositories

1

ache

ACHE is a web crawler for domain-specific search.
Java
454
star
2

tile2net

Automated mapping of pedestrian networks from aerial imagery tiles
Python
151
star
3

PipelineVis

Pipeline Profiler is a tool for visualizing machine learning pipelines generated by AutoML tools.
JavaScript
84
star
4

openclean

openclean - Data Cleaning and data profiling library for Python
Python
66
star
5

TaxiVis

Visual Exploration of New York City Taxi Trips
C++
54
star
6

urban-pulse

A standalone version of Urban Pulse
TypeScript
50
star
7

domain_discovery_tool

This repository contains the Domain Discovery Tool (DDT) project. DDT is an interactive system that helps users explore and better understand a domain (or topic) as it is represented on the Web.
JavaScript
46
star
8

data-polygamy

Data Polygamy is a topology-based framework that allows users to query for statistically significant relationships between spatio-temporal data sets.
Java
43
star
9

auctus

Dataset search engine, discovering data from a variety of sources, profiling it, and allowing advanced queries on the index
Python
41
star
10

city-surfaces

CitySurfaces semantic segmentation of sidewalk surfaces
Python
40
star
11

shadow-accrual-maps

Accumulated shadow data computed for New York City
Python
27
star
12

domain_discovery_tool_deprecated

Seed acquisition tool to bootstrap focused crawlers
JavaScript
23
star
13

alpha-automl

Alpha-AutoML is a Python library for automatically generating end-to-end machine learning pipelines.
Python
19
star
14

reprozip-examples

Examples and demos for ReproZip
HTML
16
star
15

pycalibrate

pycalibrate is a Python library to visually analyze model calibration in Jupyter Notebooks
Jupyter Notebook
16
star
16

memex

HTML
13
star
17

reproducibility-news

Currated reproducibility news displayed on reproduciblescience.org
Python
12
star
18

raster-join

C++
11
star
19

urban-data-study

Python
11
star
20

reproserver

A web application reproducing ReproZip packages in the cloud.
Python
10
star
21

BugDoc

BugDoc: python package to debug computational pipelines
Python
10
star
22

aws_taxi

Sample scripts to analyze taxi data on Amazon AWS
Python
10
star
23

domain-discovery-d4

Data-Driven Domain Discovery for Structured Datasets
Java
10
star
24

domain_discovery_API

Domain Discovery Operations API formalizes the human domain discovery process by defining a set of operations that capture the essential tasks that lead to domain discovery on the Web as we have discovered in interacting with the Subject Matter Experts (SME)s.
Python
8
star
25

ARGUS

ARGUS is a visual analytics tool that facilitates multimodal data collection, enables quick user modeling, and allows for retrospective analysis and debugging of historical data generated by the AR sensors and ML models that support task guidance.
TypeScript
7
star
26

openclean-core

Data Cleaning and Data Profiling Library for Python
Python
7
star
27

bdi-kit

A Python toolkit for biomedical data integration
Python
6
star
28

reproducible-science

Python
6
star
29

genotet

Genotet: An Interactive Web-based Visual Exploration Framework to Support Validation of Gene Regulatory Networks
JavaScript
6
star
30

Urban-Rhapsody

TypeScript
6
star
31

alphad3m

Jupyter Notebook
5
star
32

Segmentangling

C
5
star
33

openclean-pattern

Pattern identifier and anomaly detector
Python
5
star
34

birdvis

Source code for the BirdVis project, for more information visit www.birdvis.org
C++
5
star
35

tim-reasoning

Jupyter Notebook
4
star
36

mongodb-vls

MongoDB-VLS is an implementation of VLS (Virtual Lightweight Snapshots) in MongoDB. VLS is a mechanism that enables consistent analytics without blocking incoming updates in NoSQL stores.
C++
4
star
37

urban-data-provider

Download and transform (open urban) data sets from different data provider
Java
3
star
38

SamplingMethodsForInnerProductSketching

Python
3
star
39

openclean-metanome

Python package to run Metanome data profiling algorithms
Python
2
star
40

openclean-notebook

UI for openclean in Jupyter and Colab Notebooks
TypeScript
2
star
41

vida-nyu.github.io

Home page for the group
HTML
2
star
42

BusExplorer

Bus Time Tool: a web-based tool for the exploration of bus trajectory data
JavaScript
2
star
43

openclean-geo

Geo-Spatial Data Extension for openclean
Python
2
star
44

prida

PRIDA: Pruning Irrelevant Datasets for Data Augmentation.
Jupyter Notebook
2
star
45

ARIES-Issues

A version of ARIES
2
star
46

pedestrian-sensing-model

Generation of a pedestrian density map using ground-level images.
Python
2
star
47

city-surfaces-old

2
star
48

reproducible-science-nyu

https://nyu.reproduciblescience.org
Python
2
star
49

ptg-api-server

Python
1
star
50

Interactive-Visualization-Jupyter-Notebooks

Jupyter Notebook
1
star
51

redis-streamer

An API to communicate with redis over websockets
Python
1
star
52

interactivecalibration

Interactive Calibration Plots
Jupyter Notebook
1
star
53

repromatch

Website designed to help you find the tool (or tools) that best matches your reproduciblity needs
HTML
1
star
54

cmu-mmac2epic-kitchens

CMU MMAC 2 Epic Kitchens annotation format
Python
1
star
55

ptg-server-ml

The machine learning model deployment
Jupyter Notebook
1
star
56

python-staticflow

Construct a data flow from static analysis of Python code
Python
1
star
57

minesafe

Minesafe is a Crowdsourcing information system for people in rural areas of countries affected by antipersonnel mines
Java
1
star
58

user-agent-study

Python
1
star
59

wildlife_pipeline

Python
1
star
60

urban-event-detection

Python
1
star
61

artist

Python
1
star
62

urban-data-core

Core functionality and classes for Urban Data Integration project
Java
1
star
63

memex-cdr

Memex Crawl Data Repository Standard
Java
1
star
64

inner-product-sketches

1
star