• Stars
    star
    107
  • Rank 323,587 (Top 7 %)
  • Language
  • License
    Creative Commons ...
  • Created over 9 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Collecting thoughts about data versioning

Data Versioning

(c) Thomas J. Leeper (2015), licensed CC-BY

Version control is a huge part of reproducible research and open source software development. Versioning provides a complete history of some digital object (e.g., a software program, a research project, etc.) and, importantly, allows one to trace what changes have made to that object, when those changes were made, and (with the appropriate metadata) why those changes were made. This document holds some of my current thinking about version control for data.

Background Context

Especially in the social sciences, researchers depend on large, public datasets (e.g., Polity, Quality of Government, Correlates of War, ANES, ESS, etc.) as source material for quantitative research. These datasets typically evolve (new data is added over time, corrections are made to data values, etc.) and new releases are periodically made public. Sometimes these data are complex collaborative efforts (see, for example, Quality of Government) and others are public releases of single-institution data collection efforts (e.g., ANES). While collaborative datasets create a more obvious use case for version control, single-institution datasets might also be improved by version control. This is particularly important because old releases of these vital datasets are often not archived (e.g., ANES) meaning that it is essentially impossible to recover a prior version of a given ANES dataset after a new release has occurred. This post is meant to steer thinking about how to manage the creation, curation, revision, and dissemination of these kinds of datasets. While the ideas here might also apply to how one thinks about managing their own data, they probably apply more at the stage of data creation than at later data use after a dataset is essentially complete or frozen.

Review of Existing Tools and Approaches

The Open Data Institute has a nice post outlining the challenges of using standard, software-oriented version control software (namely, git) for version control of data. The main issue is that git, like almost all VCS, is designed to monitor changes to lines in text files. This makes a lot of sense for code, as well as for articles, text, etc. But it starts to make less sense for digital objects where a line is not a meaningful unit. This becomes really clear when we start to version something like a comma-separated values (CSV) file (as the ODI post describes). Changing a single data field leads to a full-line change, even though only one cell actually changed. A similar problem emerges in XML, JSON, or other text-delimited formats (though, note that the Open Knowledge Foundation seems to favor JSON as a storage mode).

Using the linebreak \n as delimiter to track changes to an object does work for data. daff is a javascript tool that tries to make up for that weakness. It provides a diff (comparison of two file versions) that respects the tabular structure of many data files by highlighting cell-specific changes. This is not a full version control system, though, it's simply a diff tool that can serve as a useful add-on to git.

A post on the Open Knowledge Foundation blog argues that there is some logic to line-based changesets because a standard CSV (or any tabular data file) typically records a single observation on its own line. Line-based version control then makes sense for recording changes to observations (thus privileging rows/observations over columns/variables).

dat aims to try to be a "git for data" but the status of the project is a little unclear, given there hasn't been very active development on Github. A Q/A on the Open Data StackExchange points to some further resources for data-appropriate git alternatives, but there's nothing comprehensive.

Universal Numeric Fingerprint (UNF) provides a potential strategy for version control of data. Indeed, that's specifically what it was designed to do. It's not a version control system per se, but it provides a hash that can be useful for determining when a dataset has changed. It has some nice features:

  • File format independent (so better than an MD5)
  • Column order independent (a dataset where columns have been shuffled produces the same UNF hash)
  • Consistently handles floating point numbers (rounding and precision issues are unproblematic) But, UNF is not perfect. The problems include:
  • It is sort-sensitive (an identical dataset that is row sorted produces a different UNF; thus privileging columns/variables over rows/observations)
  • There is no handling of metadata (e.g., variable names are not part of the hash)
  • It is quite sensitive to data structure (e.g., "wide" and "long" representations of the same dataset produce different UNFs)
  • It is not a version control system and provides essentially no insights into what changed, only that a change occurred

All of these tools also focus on the data themselves, rather than associated metadata (e.g., the codebook describing the data). While some data formats (e.g., proprietary formats like Stata's .dta and SPSS's .sav) encode this metadata directly in the file, it is not a common feature of widely text-delimited data structures. Sometimes codebooks are modified independent of data values and vice versa, but it's rather to see large public datasets provide detailed information about changes to either the data or the codebook, except in occasional releases.

Another major challenge to data versioning is that existing tools version control are not well-designed to handle provenance. When data is generated, stored, or modified, a software-oriented version control system has no obvious mechanism for recording why values in a dataset are what they are or why changes are made to particular values. A commit message might provide this information, but as soon as a value is changed again, the history of changes to a particular value are lost in the broader history of the data file as a whole.

So, there are clearly no complete tools in existence for versioning data.

Some Principles of Data Versioning

The first principle of data versioning is that changes to data have sources or explanations. A system of data versioning must be able to connect data values, structure, and metadata (and changes to those features) to explanations of those values or the changes to values at the value level (rather than at the level of variables, observations, or files).

The second principle is that data versioning should be value-based, not variable- or observation-based. A system cannot privilege observations over variables or variables over observations; a change to an observation is necessarily a change to a variable, and vice versa.

The third principle of data versioning is that data exist independent of their format. If one changes a data value in a CSV versus a JSON tree, those are content-equivalent changes. As such, any system of version control should allow data users to interact with data in whatever file format they choose without necessarily using the underlying data storage format.

The fourth principle of data versioning is that collaboration is essential to data generation and curation. A system of data versioning must be natively collaborative and logically record who is generating and modifying data.

The fifth principle of data versioning is that changes to data structure should be recorded independently of data values. Sort order of observations, the arrangement of columns/variables, and the arrangement of rows as cases versus case-years, etc. (i.e., "wide" versus "long" arrangements) are structural features of a dataset, not features of the data per se. These are important, but a data versioning process should be able to distinguish a change to the content of a dataset from a change to the organization or structure of a dataset and, in the latter case, correctly recognize as identical a dataset that is arranged in two different ways.

The sixth principle of data versioning is that metadata matters but, like structure, should be handled separately from changes to data. Two identical datasets with different metadata should be recognized as content-identical.

Tentative Conclusions

Data should be stored in a key-value manner, where an arbitrary key holds a particular data value. A mapping then connects those particular data values to both observations and variables, so that any assessment of changes to data are format-independent and structure-independent. As such, a change to a value is recorded first as a change to a value but can be secondarily recognized as a simultaneous change to both an observation and a variable.

Any interface to such a key-value store should come in a familiar and flexible form: users should interact with a data versioning system via whatever manner they currently use data (e.g., a text editor, data analysis software, a spreadsheet application, etc.). Changes should be recorded in a master file that can natively and immediately import from and export to any data file format (including delimited files, spreadsheets, XML, JSON, etc.).

Data versioning systems must have a more sophisticated commit system than that provided by current, software-oriented version control systems. In particular, commits should record not only the change to a data value, structure, or metadata but also structured information that explains that change, including the reason for the change, the source(s) of the data value, and the author of the change. In essence, both changesets and states of the complete data should be fully citable and carry citation-relevant metadata.

More Repositories

1

rio

A Swiss-Army Knife for Data I/O
R
556
star
2

margins

An R Port of Stata's 'margins' Command
R
261
star
3

slopegraph

Edward Tufte-Inspired Slopegraphs
R
186
star
4

meme

Meme Generation in R
R
118
star
5

tttable

Defining a grammar of tables
92
star
6

prediction

Tidy, Type-Safe 'prediction()' Methods
R
87
star
7

csvy

Import and Export CSV Data With a YAML Metadata Header
R
53
star
8

conjoint-example

An Example Conjoint Experimental Design in Qualtrics
HTML
50
star
9

cregg

Simple Conjoint Analyses, Tidying, and Visualization
R
50
star
10

make-example

An example of using make for a data analysis project
R
32
star
11

pdfcount

An R Shiny App to Count Words in a PDF Document
R
28
star
12

htlatex-example

Demo of htlatex for LaTeX to Word (.docx) file conversion
TeX
26
star
13

reggie

Stata-like Regression Functionality for R
R
24
star
14

UNF

Tools for Creating Universal Numeric Fingerprints for Data
R
21
star
15

designcourse

Course materials for "Research Design in Political Science"
TeX
18
star
16

references

All of my bibliographic references
TeX
18
star
17

Depends

Quick demo of the risks of using 'Depends' in R packages
R
14
star
18

lumendb

Lumen Database (Chilling Effects) API Client
R
11
star
19

Rcourse

R Course Materials
HTML
11
star
20

surveycourse

Course materials for seminar on survey research methods
TeX
11
star
21

rite

rite: The Right Editor to Write R
R
8
star
22

lectures

Collection of files for miscellaneous talks and lectures
TeX
8
star
23

rrcourse

Course Materials for Reproducible Research Workshop
TeX
8
star
24

rpublons

Client for Publons.com
R
7
star
25

poliscitoys

Toy datasets for political science methods
7
star
26

lookfor

R port of Stata's lookfor command
R
7
star
27

rio.db

A Database Extension for 'rio'
R
6
star
28

webuse

Import Stata 'webuse' Datasets
R
6
star
29

surveyexpcourse

Materials for a short course on Survey Experiments
TeX
6
star
30

expcourse

Course materials for "Experimentation and Causal Inference"
TeX
6
star
31

responserates

AAPOR Survey Response Rates
R
6
star
32

arco

Select colors from the Tcl/tk `chooseColors` widget
R
5
star
33

textcolor

R
5
star
34

opinioncourse

Materials for "Public Opinion, Political Psychology, and Citizenship"
TeX
5
star
35

sparktex

Generate LaTeX sparklines in R
R
4
star
36

choco-r-devel

Chocolatey package for r-devel
PowerShell
4
star
37

apsa-leeper.bst

BibTeX style file for political science (adapted from apsa.bst)
4
star
38

interaction-plot

A simple shiny app for examining interaction effects
R
4
star
39

choco-rtools

Chocolatey package for Rtools
PowerShell
4
star
40

conjoint-subgroups

Reproduction Materials for "Measuring Subgroup Preferences in Conjoint Experiments"
TeX
4
star
41

statcompcourse

Course materials for Introduction to Statistical Computing
TeX
3
star
42

Impressive

A fork of Martin Fiedler's Impressive presentation application
Python
3
star
43

websurveycourse

Materials for a short course on online surveys and survey experiments
TeX
3
star
44

statapkg

A repo experimenting with installation of Stata packages from GitHub
HTML
2
star
45

regcourse

Some course materials about regression
TeX
2
star
46

GK2011

Gaines and Kuklinski (2011) Estimators for Hybrid Experiments
R
2
star
47

crandatapkgs

Find Data-Only Packages on CRAN
R
2
star
48

mcode

Functions to merge and recode across multiple variables
R
2
star
49

dual-axis

dual y-axis figures are usually terrible
R
2
star
50

excel-vba-userform-marking

Some VBA for an Excel UserForm used for essay marking at LSE
Visual Basic
2
star
51

mturkr-article

Materials for a short article about MTurkR
TeX
2
star
52

openfda

R
1
star
53

praiserror

A Sarcastic and Demoralizing Error Handler
R
1
star
54

ted-principle

Tidy Experimental Data: The TED Principle
HTML
1
star
55

ciplotm

Modified Version of Nick Cox's 'ciplot' ado for Stata
Stata
1
star
56

xyllabus

A Syllabus Markup Language
1
star
57

rcompiler

Automatically exported from code.google.com/p/rcompiler
C++
1
star
58

hints

Automatically exported from code.google.com/p/hints
TeX
1
star