• Stars
    star
    209
  • Rank 188,325 (Top 4 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 5 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Efficient Data Loading Pipeline in Pure Python

Tensorpack DataFlow

Tensorpack DataFlow is an efficient and flexible data loading pipeline for deep learning, written in pure Python.

Its main features are:

  1. Highly-optimized for speed. Parallelization in Python is hard and most libraries do it wrong. DataFlow implements highly-optimized parallel building blocks which gives you an easy interface to parallelize your workload.

  2. Written in pure Python. This allows it to be used together with any other Python-based library.

DataFlow is originally part of the tensorpack library and has been through many years of polishing. Given its independence of the rest of the tensorpack library, it is now a separate library whose source code is synced with tensorpack. Please use tensorpack issues for support.

Why would you want to use DataFlow instead of a platform-specific data loading solutions? We recommend you to read Why DataFlow?.

Install:

pip install --upgrade git+https://github.com/tensorpack/dataflow.git
# or add `--user` to install to user's local directories

You may also need to install opencv, which is used by many builtin DataFlows.

Examples:

import dataflow as D
d = D.ILSVRC12('/path/to/imagenet')  # produce [img, label]
d = D.MapDataComponent(d, lambda img: some_transform(img), index=0)
d = D.MultiProcessMapData(d, num_proc=10, lambda img, label: other_transform(img, label))
d = D.BatchData(d, 64)
d.reset_state()
for img, label in d:
  # ...

Documentation:

Tutorials:

  1. Basics
  2. Why DataFlow?
  3. Write a DataFlow
  4. Parallel DataFlow
  5. Efficient DataFlow

APIs:

  1. Built-in DataFlows
  2. Built-in Datasets

Support & Contributing

Please send issues and pull requests (for the dataflow/ directory) to the tensorpack project where the source code is developed.