• This repository has been archived on 11/Dec/2023
  • Stars
    star
    959
  • Rank 47,674 (Top 1.0 %)
  • Language
    C
  • Created about 14 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A columnar data container that can be compressed.

Unmaintained Package Notice

Unfortunately, and due to lack of resources, the Blosc Development Team is unable to maintain this package anymore. During the last 10 years we managed to find resources (even if in a quite irregular way) to develop what we think is a nice package for handling compressed data containers, especially tabular data. Regrettably, for the last years we did not found sponsorship enough to continue the maintenance of this package.

For those that depend on bcolz, a fork is welcome and we will try our best to provide advice for possible new maintainers. Indeed, if we manage to get some decent grants via Blosc (https://blosc.org/pages/donate/), our umbrella project, we would be glad to reconsider the maintenance of bcolz. But again, we would be very open and supportive for this project to get a new maintenance team.

Finally, thanks to all the people that used and contributed in one way or another to bcolz; it has been a nice ride! Let's hope it still would have a bright future ahead.

The Blosc Development Team

bcolz: columnar and compressed data containers

Join the chat at https://gitter.im/Blosc/bcolz
Version:version
Travis CI:travis
Appveyor:appveyor
Coveralls:coveralls
And...:powered

docs/bcolz.png

bcolz provides columnar, chunked data containers that can be compressed either in-memory and on-disk. Column storage allows for efficiently querying tables, as well as for cheap column addition and removal. It is based on NumPy, and uses it as the standard data container to communicate with bcolz objects, but it also comes with support for import/export facilities to/from HDF5/PyTables tables and pandas dataframes.

bcolz objects are compressed by default not only for reducing memory/disk storage, but also to improve I/O speed. The compression process is carried out internally by Blosc, a high-performance, multithreaded meta-compressor that is optimized for binary data (although it works with text data just fine too).

bcolz can also use numexpr internally (it does that by default if it detects numexpr installed) or dask so as to accelerate many vector and query operations (although it can use pure NumPy for doing so too). numexpr/dask can optimize the memory usage and use multithreading for doing the computations, so it is blazing fast. This, in combination with carray/ctable disk-based, compressed containers, can be used for performing out-of-core computations efficiently, but most importantly transparently.

Just to whet your appetite, here it is an example with real data, where bcolz is already fulfilling the promise of accelerating memory I/O by using compression:

http://nbviewer.ipython.org/github/Blosc/movielens-bench/blob/master/querying-ep14.ipynb

Rationale

By using compression, you can deal with more data using the same amount of memory, which is very good on itself. But in case you are wondering about the price to pay in terms of performance, you should know that nowadays memory access is the most common bottleneck in many computational scenarios, and that CPUs spend most of its time waiting for data. Hence, having data compressed in memory can reduce the stress of the memory subsystem as well.

Furthermore, columnar means that the tabular datasets are stored column-wise order, and this turns out to offer better opportunities to improve compression ratio. This is because data tends to expose more similarity in elements that sit in the same column rather than those in the same row, so compressors generally do a much better job when data is aligned in such column-wise order. In addition, when you have to deal with tables with a large number of columns and your operations only involve some of them, a columnar-wise storage tends to be much more effective because minimizes the amount of data that travels to CPU caches.

So, the ultimate goal for bcolz is not only reducing the memory needs of large arrays/tables, but also making bcolz operations to go faster than using a traditional data container like those in NumPy or Pandas. That is actually already the case in some real-life scenarios (see the notebook above) but that will become pretty more noticeable in combination with forthcoming, faster CPUs integrating more cores and wider vector units.

Requisites

  • Python >= 2.7 and >= 3.5
  • NumPy >= 1.8
  • Cython >= 0.22 (just for compiling the beast)
  • C-Blosc >= 1.8.0 (optional, as the internal Blosc will be used by default)

Optional:

  • numexpr >= 2.5.2
  • dask >= 0.9.0
  • pandas
  • tables (pytables)

Building

There are different ways to compile bcolz, depending if you want to link with an already installed Blosc library or not.

Compiling with an installed Blosc library (recommended)

Python and Blosc-powered extensions have a difficult relationship when compiled using GCC, so this is why using an external C-Blosc library is recommended for maximum performance (for details, see Blosc/python-blosc#110).

Go to https://github.com/Blosc/c-blosc/releases and download and install the C-Blosc library. Then, you can tell bcolz where is the C-Blosc library in a couple of ways:

Using an environment variable:

$ BLOSC_DIR=/usr/local     (or "set BLOSC_DIR=\blosc" on Win)
$ export BLOSC_DIR         (not needed on Win)
$ python setup.py build_ext --inplace

Using a flag:

$ python setup.py build_ext --inplace --blosc=/usr/local

Compiling without an installed Blosc library

bcolz also comes with the Blosc sources with it so, assuming that you have a C++ compiler installed, do:

$ python setup.py build_ext --inplace

That's all. You can proceed with testing section now.

Note: The requirement for the C++ compiler is just for the Snappy dependency. The rest of the other components of Blosc are pure C (including the LZ4 and Zlib libraries).

Testing

After compiling, you can quickly check that the package is sane by running:

$ PYTHONPATH=.   (or "set PYTHONPATH=." on Windows)
$ export PYTHONPATH    (not needed on Windows)
$ python -c"import bcolz; bcolz.test()"  # add `heavy=True` if desired

Installing

Install it as a typical Python package:

$ pip install -U .

Optionally Install the additional dependencies:

$ pip install .[optional]

Documentation

You can find the online manual at:

http://bcolz.blosc.org

but of course, you can always access docstrings from the console (i.e. help(bcolz.ctable)).

Also, you may want to look at the bench/ directory for some examples of use.

Resources

Visit the main bcolz site repository at: http://github.com/Blosc/bcolz

Home of Blosc compressor: http://blosc.org

User's mail list: http://groups.google.com/group/bcolz ([email protected])

An introductory talk (20 min) about bcolz at EuroPython 2014. Slides here.

License

Please see BCOLZ.txt in LICENSES/ directory.

Share your experience

Let us know of any bugs, suggestions, gripes, kudos, etc. you may have.

Enjoy Data!

More Repositories

1

c-blosc

A blocking, shuffling and loss-less compression library that can be faster than `memcpy()`.
C
982
star
2

c-blosc2

A fast, compressed, persistent binary data store library for C.
C
438
star
3

python-blosc

A Python wrapper for the extremely fast Blosc compression library
Python
351
star
4

bloscpack

Command line interface to and serialization format for Blosc
Python
122
star
5

python-blosc2

Jupyter Notebook
82
star
6

hdf5-blosc

Filter for HDF5 that uses Blosc
C
43
star
7

python-caterva

Python wrapper for Caterva. Still preliminary.
Python
21
star
8

movielens-bench

Datafiles for the MovieLens for benchmarking purposes
Jupyter Notebook
13
star
9

JBlosc

Java interface for Blosc library
HTML
6
star
10

JBlosc2

Java interface for Blosc2 library
Java
6
star
11

BTune

Optimize Blosc2 parameters using deep/machine learning
C
5
star
12

pycblosc2

A simple Python/CFFI interface for the C-Blosc2 library
Python
5
star
13

b2h5py

Transparent optimized reading of n-dimensional Blosc2 slices for h5py
Python
5
star
14

blosc2_grok

Blosc2 plugin for grok
Jupyter Notebook
5
star
15

pycblosc

A low level Python interface to the C-Blosc library
Python
4
star
16

Blosc2-Btune

BTUNE plugin for Blosc2. Automatically choose the best codec/filter for your data.
C
4
star
17

subtree-merge-blosc

Script to automatically subtree merge a specifc version of blosc.
Shell
3
star
18

blosc2-htj2k

Playground for Blosc2 and HTJ2K
C++
3
star
19

caterva-scipy21

Caterva poster for SciPy Conference 2021!
Jinja
3
star
20

python-blosc-wheels

Shell
3
star
21

python-blosc-conda-recipe

Conda recipe for python-blosc
Shell
2
star
22

governance

The governance process and model for Project Blosc
2
star
23

blosc2_openhtj2k

Dynamic plugin for OpenHTJ2K
C++
2
star
24

leaps-examples

Jupyter Notebook
2
star
25

Gaia

Scripts for processing Gaia datasets
Jupyter Notebook
2
star
26

blosc-projects-best-practices

Some notes on best practices for all Blosc related projects
2
star
27

blogsite

The Blogsite for Blosc
HTML
2
star
28

python-blosc2-c

A Python wrapper for the extremely fast Blosc2 compression library http://python-blosc2.blosc.org
C
2
star
29

Caterva2

REST and on-demand access to local/remote Blosc2 data repositories
Python
2
star
30

bloscpack-benchmarking

Python
1
star
31

bcolz-conda-recipe

Conda recipe for bcolz
Shell
1
star
32

caterva-scipy21-lt

Jupyter Notebook
1
star
33

community

General discussions on present and future of Blosc projects
1
star
34

exploring-milky-way

Scripts for the SciPy 2023 talk, "A Fast Explorer Of The Milky Way"
Jupyter Notebook
1
star
35

blosc2_plugin_example

Example of a Blosc2 plugin
C
1
star
36

blosc-doc

This repository will gather together all the Blosc documentation.
C
1
star