• Stars
    star
    195
  • Rank 199,374 (Top 4 %)
  • Language
    Go
  • License
    Other
  • Created almost 12 years ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Distributed, Versioned, Image-oriented Dataservice

DVID Picture

Go Report Card GoDoc

DVID is a Distributed, Versioned, Image-oriented Dataservice written to support neural reconstruction, analysis and visualization efforts at HHMI Janelia Research Center using teravoxel-scale image volumes.

Its goal is to provide:

  • A framework for thinking of distribution and versioning of large-scale scientific data similar to distributed version control systems like git.
  • Easily extensible data types (e.g., annotation, keyvalue, and labelmap in figure below) that allow tailoring of APIs, access speeds, and storage space for different kinds of data.
  • The ability to use a variety of storage systems via plugin storage engines, currently limited to systems that can be viewed as (preferably ordered) key-value stores.
  • A stable science-driven HTTP API that can be implemented either by native DVID data types or by proxying to other services.

High-level architecture of DVID

How it's different from other forms of versioned data systems:

  • DVID handles large-scale data as in billions or more discrete units of data. Once you get to this scale, storing so many files can be difficult on a local file system or impose a lot of load even on shared file systems. Cloud storage is always an option but that adds latency and doesn't reduce transfer time of such large numbers of files or data chunks. Database systems (including embedded ones) handle this by consolidating many bits of data into larger files. This can also be described as a sharded data approach.
  • All versions are available for queries. There is no checkout to read committed data.
  • The high-level science API uses pluggable datatypes. This allows clients to operate on domain-specific data and operations rather than operations on generic files.
  • Data can be flexibly assigned to different types of storage, so tera- to peta-scale immutable imaging data can be kept in cloud storage while smaller, frequently mutated label data can be kept on fast local NVMe SSDs. This also allows data to be partitioned across databases.
  • (Work in progress) A newer storage backend (DAGStore) will allow "chained storage" such that data published at a particular version, say on AWS Open Data, could be reused for later versions with only new modifications stored locally. This requires extending storage flexibility to versions of data across storage locations. DAGStore will greatly simplify "pull requests" where just the changes within a set of versions are transmitted between separate DVID servers.

While much of the effort has been focused on the needs of the Janelia FlyEM Team, DVID can be used as a general-purpose branched versioning file system that handles billions of files and terabytes of data by creating instances of the keyvalue datatype. Our team uses the keyvalue datatype for branched versioning of JSON, configuration, and other files using the simple key-value HTTP API.

DVID aspires to be a "github for large-scale scientific data" because a variety of interrelated data (like image volume, labels, annotations, skeletons, meshes, and JSON data) can be versioned together. DVID currently handles branched versioning of large-scale data and does not provide domain-specific diff tools to compare data from versions, which would be a necessary step for user-friendly pull requests and truly collaborative data editing.

Table of Contents

Installation

Users should install DVID from the releases. The main branch of DVID may include breaking changes required by our research work.

Developers should consult the install README where our conda-based process is described.

DVID has been tested on MacOS X, Linux (Fedora 16, CentOS 6, Ubuntu), and Windows 10+ Bash Shell. It comes out-of-the-box with several embedded key-value databases (Badger, Basho's leveldb) for storage although you can configure other storage backends.

Before launching DVID, you'll have to create a configuration file describing ports, the types of storage engines, and where the data should be stored. Both simple and complex sample configuration files are provided in the scripts/distro-files directory.

Basic Usage

Some documentation is available on the DVID wiki for how to start the DVID server. While the wiki's User Guide provides simple console-based toy examples, please note that how our team uses the DVID services is much more complex due to our variety of clients and script-based usage. Please see the neuclease python library for more realistic ways to use DVID at scale and, in particular, for larger image volumes.

More Information

Both high-level and detailed descriptions of DVID and its ecosystem can found here:

DVID is easily extensible by adding custom data types, each of which fulfill a minimal interface (e.g., HTTP request handling), DVID's initial focus is on efficiently handling data essential for Janelia's connectomics research:

  • image and 64-bit label 3d volumes, including multiscale support
  • 2d images in XY, XZ, YZ, and arbitrary orientation
  • multiscale 2d images in XY, XZ, and YZ, similar to quadtrees
  • sparse volumes, corresponding to each unique label in a volume, that can be merged or split
  • point annotations (e.g., synapse elements) that can be quickly accessed via subvolumes or labels
  • label graphs
  • regions of interest represented via a coarse subdivision of space using block indices
  • 2d and 3d image and label data using Google BrainMaps API and other cloud-based services

Each of the above is handled by built-in data types via a Level 2 REST HTTP API implemented by Go language packages within the datatype directory. When dealing with novel data, we typically use the generic keyvalue datatype and store JSON-encoded or binary data until we understand the desired access patterns and API. When we outgrow the keyvalue type's GET, POST, and DELETE operations, we create a custom datatype package with a specialized HTTP API.

DVID allows you to assign different storage systems to data instances within a single repo, which allows great flexibility in optimizing storage for particular use cases. For example, easily compressed label data can be store in fast, expensive SSDs while larger, immutable grayscale image data can be stored in petabyte-scale read-optimized systems.

DVID is written in Go and supports pluggable storage backends, a REST HTTP API, and command-line access (likely minimized in near future). Some components written in C, e.g., storage engines like Leveldb and fast codecs like lz4, are embedded or linked as a library.

Command-line and HTTP API documentation can be found in help constants within packages or by visiting the /api/help HTTP endpoint on a running DVID server.

Monitoring

Mutations and activity logging can be sent to a Kafka server. We use kafka activity topics to feed Kibana for analyzing DVID performance.

Snapshot of Kibana web page for DVID metrics

Known Clients with DVID Support

Programmatic clients:

  • neuclease, python library from HHMI Janelia
  • intern, python library from Johns Hopkins APL
  • natverse, R library from Jefferis Lab
  • libdvid-cpp, C++ library from HHMI Janelia FlyEM

GUI clients:

Screenshot of an early web app prototype pulling neuron data and 2d slices from 3d grayscale data:

Web app for 3d inspection being served from and sending requests to DVID

More Repositories

1

gala

Automatic segmentation of electron microscopy volumes
Python
72
star
2

NeuTu

Software package for neuron reconstruction and visualization
C++
37
star
3

buildem

CMake and other scripts to help build process of FlyEM software
CMake
27
star
4

NeuroProof

Tool for graph-based image segmentation and analysis
C++
22
star
5

neuclease

Tools for computing interactive "cleaves" of agglomerated neuron fragments from a DVID server.
Jupyter Notebook
9
star
6

diced

Interface that allows versioned access to cloud-backed nD datasets.
Python
9
star
7

neuroproof_examples

Example data and commands for exploring NeuroProof
8
star
8

DVIDSparkServices

Spark implemented EM reconstruction software workflows
Python
8
star
9

ConnectomeHackathon2016

DVID Ecosystem, Big Data Infrastructure, and Deployment for EM Connectomics
7
star
10

flypylib

flyem python library for synapse/object detection
Python
7
star
11

libdvid-cpp

C++ library for accessing DVID's REST API
HTML
6
star
12

serviceproxy

Application that manages REST-based Services
JavaScript
5
star
13

ConnectomeHackathon2015

Information on the Janelia Connectome Workshop and Hackathon 2015
5
star
14

flymatlib

FlyEM Matlab library of machine learning, deep learning methods for EM reconstruction
MATLAB
5
star
15

dvid-console

Web console and API help for DVID
JavaScript
3
star
16

neutuse

JavaScript
3
star
17

compressedseg

C++
3
star
18

go

Snapshotted, mirrored libraries of useful go packages.
Go
2
star
19

pyzem

Python
2
star
20

SevenMedullaColumnConnectome

2
star
21

dvid-tileviewer

Web viewer for DVID quadtrees.
JavaScript
2
star
22

janelia-flyem.github.com

Janelia Farm FlyEM Project
CSS
2
star
23

echo-secure

Go library that uses Google oAuth2 authentication and custom authorization for the echo framework.
Go
2
star
24

dvidjs

Javascript API to Access a DVID server
JavaScript
1
star
25

sample-grayscale

Sample stack of 8-bit grayscale images.
1
star
26

dvid-resource-manager

Manages large-scale batch requests to DVID.
Python
1
star
27

synapse-geometry

A python library that supports synapse-level processing on neuroimaging data
Python
1
star
28

SegmentationEvaluationConsole

Web interface to evaluate EM segmentation.
JavaScript
1
star
29

flyem-views

A git repo to preserve the history of gs://flyem-views
1
star
30

fiji_cloudrun

Headles fiji at scale using serverless google cloud run functions
Python
1
star
31

ingestion-guide-aso-mb-4nm

A guide to ingesting new volumes into DVID, using the Aso MB 4nm as an example.
Python
1
star
32

hemibrain-dvid-ingestion-status

Status and locations of various FlyEM volumes ingested into DVID
CSS
1
star
33

flyemflows

Python
1
star
34

cluster-calclabels

Framework for clustering label computation of an image dataset
Python
1
star
35

lowtis

Low Latency Image Service
C++
1
star
36

transfer_em

Style transfer of EM data volumes using cycle GANs
Python
1
star
37

pydvid

(Deprecated: see libdvid-cpp python) Python interface for calling the DVID http REST API
Python
1
star
38

flyem-services

Authentication and authorization for flyem services
Go
1
star