• Stars
    star
    100
  • Rank 329,712 (Top 7 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created about 9 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Concurrent appendable key-value storage

PartD

Build Status Version Status

Key-value byte store with appendable values

Partd stores key-value pairs. Values are raw bytes. We append on old values.

Partd excels at shuffling operations.

Operations

PartD has two main operations, append and get.

Example

  1. Create a Partd backed by a directory:

    >>> import partd
    >>> p = partd.File('/path/to/new/dataset/')
    
  2. Append key-byte pairs to dataset:

    >>> p.append({'x': b'Hello ', 'y': b'123'})
    >>> p.append({'x': b'world!', 'y': b'456'})
    
  3. Get bytes associated to keys:

    >>> p.get('x')         # One key
    b'Hello world!'
    
    >>> p.get(['y', 'x'])  # List of keys
    [b'123456', b'Hello world!']
    
  4. Destroy partd dataset:

    >>> p.drop()
    

That's it.

Implementations

We can back a partd by an in-memory dictionary:

>>> p = Dict()

For larger amounts of data or to share data between processes we back a partd by a directory of files. This uses file-based locks for consistency.:

>>> p = File('/path/to/dataset/')

However this can fail for many small writes. In these cases you may wish to buffer one partd with another, keeping a fixed maximum of data in the buffering partd. This writes the larger elements of the first partd to the second partd when space runs low:

>>> p = Buffer(Dict(), File(), available_memory=2e9)  # 2GB memory buffer

You might also want to have many distributed process write to a single partd consistently. This can be done with a server

  • Server Process:

    >>> p = Buffer(Dict(), File(), available_memory=2e9)  # 2GB memory buffer
    >>> s = Server(p, address='ipc://server')
    
  • Worker processes:

    >>> p = Client('ipc://server')  # Client machine talks to remote server
    

Encodings and Compression

Once we can robustly and efficiently append bytes to a partd we consider compression and encodings. This is generally available with the Encode partd, which accepts three functions, one to apply on bytes as they are written, one to apply to bytes as they are read, and one to join bytestreams. Common configurations already exist for common data and compression formats.

We may wish to compress and decompress data transparently as we interact with a partd. Objects like BZ2, Blosc, ZLib and Snappy exist and take another partd as an argument.:

>>> p = File(...)
>>> p = ZLib(p)

These work exactly as before, the (de)compression happens automatically.

Common data formats like Python lists, numpy arrays, and pandas dataframes are also supported out of the box.:

>>> p = File(...)
>>> p = NumPy(p)
>>> p.append({'x': np.array([...])})

This lets us forget about bytes and think instead in our normal data types.

Composition

In principle we want to compose all of these choices together

  1. Write policy: Dict, File, Buffer, Client
  2. Encoding: Pickle, Numpy, Pandas, ...
  3. Compression: Blosc, Snappy, ...

Partd objects compose by nesting. Here we make a partd that writes pickle encoded BZ2 compressed bytes directly to disk:

>>> p = Pickle(BZ2(File('foo')))

We could construct more complex systems that include compression, serialization, buffering, and remote access.:

>>> server = Server(Buffer(Dict(), File(), available_memory=2e0))

>>> client = Pickle(Snappy(Client(server.address)))
>>> client.append({'x': [1, 2, 3]})

More Repositories

1

dask

Parallel computing with task scheduling
Python
12,031
star
2

dask-tutorial

Dask tutorial
Jupyter Notebook
1,817
star
3

distributed

A distributed task scheduler for Dask
Python
1,544
star
4

dask-ml

Scalable Machine Learning with Dask
Python
882
star
5

dask-examples

Easy-to-run example notebooks for Dask
Jupyter Notebook
361
star
6

dask-kubernetes

Native Kubernetes integration for Dask
Python
309
star
7

dask-labextension

JupyterLab extension for Dask
TypeScript
306
star
8

dask-searchcv

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml
Python
240
star
9

dask-jobqueue

Deploy Dask on job schedulers like PBS, SLURM, and SGE
Python
230
star
10

dask-docker

Docker images for dask
Jupyter Notebook
227
star
11

dask-image

Distributed image processing
Python
199
star
12

dask-xgboost

Python
163
star
13

hdfs3

A wrapper for libhdfs3 to interact with HDFS from Python
Python
136
star
14

cachey

Caching based on computation time and storage space
Python
134
star
15

dask-cloudprovider

Cloud provider cluster managers for Dask. Supports AWS, Google Cloud Azure and more...
Python
129
star
16

dask-gateway

A multi-tenant server for securely deploying and managing Dask clusters.
Python
128
star
17

dask-ec2

Start a cluster in EC2 for dask.distributed
Python
106
star
18

dask-tensorflow

Python
93
star
19

helm-chart

Helm charts for Dask
YAML
89
star
20

dask-lightgbm

Python
78
star
21

dask-expr

Python
77
star
22

dask-glm

Python
75
star
23

dask-yarn

Deploy dask on YARN clusters
Python
69
star
24

zict

Useful Mutable Mappings
Python
68
star
25

dask-gke

kubernetes setup to bootstrap distributed on google container engine
Python
67
star
26

old-dask-examples

Collection of dask example notebooks
Jupyter Notebook
56
star
27

knit

Deprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
Python
53
star
28

dask-mpi

Deploy Dask using MPI4Py
Python
49
star
29

dask-drmaa

Deploy Dask on DRMAA clusters
Python
41
star
30

dask-stories

Python
39
star
31

dask-blog

Dask development blog
HTML
30
star
32

crick

Streaming and approximate algorithms. WIP, use at own risk.
Python
21
star
33

community

For general discussion and community planning. Discussion issues welcome.
20
star
34

dask-benchmarks

asv benchmarks for dask projects
Python
17
star
35

pandas-streaming

Python
16
star
36

mtprof

Thread-aware Python profiler hack
Python
16
star
37

dask-tutorial-infrastructure

Cluster for the Dask Tutorial.
Dockerfile
11
star
38

old-dask-yarn

Deprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
Python
7
star
39

governance

The governance process and model for Dask
7
star
40

dask-sphinx-theme

Sphinx theme for Dask documentation
Python
6
star
41

dask-ml-benchmarks

Python
5
star
42

dask.github.io

Dask Website
HTML
5
star
43

scipy-tutorials-2018

5
star
44

design-docs

Experimental repo for proposals of future work
2
star
45

.github

2
star
46

dask-org

General dask resources that aren't code
Jupyter Notebook
2
star
47

marketing

Resources and guidelines for marketing Dask
Python
1
star
48

dask-gateway-helm-repo

Repository holding published dask-gateway helm charts
1
star
49

parquet-integration

Integration tests for various parquet readers and writers
Python
1
star