• Stars
    star
    342
  • Rank 119,953 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created over 13 years ago
  • Updated 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Python wrapper for the extremely fast Blosc compression library

Python-Blosc

A Python wrapper for the extremely fast Blosc compression library

Author:The Blosc development team
Contact:[email protected]
Github:https://github.com/Blosc/python-blosc
URL:https://www.blosc.org/python-blosc/python-blosc.html
PyPi:version
Anaconda:anaconda
Gitter:gitter
Code of Conduct:Contributor Covenant

What it is

Blosc (https://blosc.org) is a high performance compressor optimized for binary data. It has been designed to transmit data to the processor cache faster than the traditional, non-compressed, direct memory fetch approach via a memcpy() OS call.

Blosc works well for compressing numerical arrays that contains data with relatively low entropy, like sparse data, time series, grids with regular-spaced values, etc.

python-blosc a Python package that wraps Blosc. python-blosc supports Python 3.8 or higher versions.

Installing

Blosc is now offering Python wheels for the main OS (Win, Mac and Linux) and platforms. You can install binary packages from PyPi using pip:

$ pip install blosc

Documentation

The Sphinx based documentation is here:

https://blosc.org/python-blosc/python-blosc.html

Also, some examples are available on python-blosc wiki page:

https://github.com/blosc/python-blosc/wiki

Lastly, here is the recording and the slides from the talk "Compress me stupid" at the EuroPython 2014.

Building

If you need more control, there are different ways to compile python-blosc, depending if you want to link with an already installed Blosc library or not.

Installing via setuptools

python-blosc comes with the Blosc sources with it and can be built with:

$ python -m pip install -r requirements-dev.txt
$ python setup.py build_ext --inplace

Any codec can be enabled (=1) or disabled (=0) on this build-path with the appropriate OS environment variables INCLUDE_LZ4, INCLUDE_SNAPPY, INCLUDE_ZLIB, and INCLUDE_ZSTD. By default all the codecs in Blosc are enabled except Snappy (due to some issues with C++ with the gcc toolchain).

Compiler specific optimisations are automatically enabled by inspecting the CPU flags building Blosc. They can be manually disabled by setting the following environmental variables: DISABLE_BLOSC_SSE2 and DISABLE_BLOSC_AVX2.

setuptools is limited to using the compiler specified in the environment variable CC which on posix systems is usually gcc. This often causes trouble with the Snappy codec, which is written in C++, and as a result Snappy is no longer compiled by default. This problem is not known to affect MSVC or clang. Snappy is considered optional in Blosc as its compression performance is below that of the other codecs.

That's all. You can proceed with testing section now.

Compiling with an installed Blosc library

This approach uses pre-built, fully optimized versions of Blosc built via CMake.

Go to https://github.com/Blosc/c-blosc/releases and download and install the C-Blosc library. Then, you can tell python-blosc where is the C-Blosc library in a couple of ways:

Using an environment variable:

$ export USE_SYSTEM_BLOSC=1                 # or "set USE_SYSTEM_BLOSC=1" on Windows
$ export Blosc_ROOT=/usr/local/customprefix # If you installed Blosc into a custom location
$ python setup.py build_ext --inplace

Using flags:

$ python setup.py build_ext --inplace -DUSE_SYSTEM_BLOSC:BOOL=YES -DBlosc_ROOT:PATH=/usr/local/customprefix

Testing

After compiling, you can quickly check that the package is sane by running the doctests in blosc/test.py:

$ python -m blosc.test  (add -v for verbose mode)

Once installed, you can re-run the tests at any time with:

$ python -c "import blosc; blosc.test()"

Benchmarking

If curious, you may want to run a small benchmark that compares a plain NumPy array copy against compression through different compressors in your Blosc build:

$ PYTHONPATH=. python bench/compress_ptr.py

Just to whet your appetite, here are the results for an Intel Xeon E5-2695 v3 @ 2.30GHz, running Python 3.5, CentOS 7, but YMMV (and will vary!):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
python-blosc version: 1.5.1.dev0
Blosc version: 1.11.2 ($Date:: 2017-01-27 #$)
Compressors available: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']
Compressor library versions:
  BloscLZ: 1.0.5
  LZ4: 1.7.5
  Snappy: 1.1.1
  Zlib: 1.2.7
  Zstd: 1.1.2
Python version: 3.5.2 |Continuum Analytics, Inc.| (default, Jul  2 2016, 17:53:06)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
Platform: Linux-3.10.0-327.18.2.el7.x86_64-x86_64 (#1 SMP Thu May 12 11:03:55 UTC 2016)
Linux dist: CentOS Linux 7.2.1511
Processor: x86_64
Byte-ordering: little
Detected cores: 56
Number of threads to use by default: 4
  -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Creating NumPy arrays with 10**8 int64/float64 elements:
  *** ctypes.memmove() *** Time for memcpy(): 0.276 s (2.70 GB/s)

Times for compressing/decompressing with clevel=5 and 24 threads

*** the arange linear distribution ***
  *** blosclz , noshuffle  ***  0.382 s (1.95 GB/s) / 0.300 s (2.48 GB/s)     Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.042 s (17.77 GB/s) / 0.027 s (27.18 GB/s)   Compr. ratio:  57.1x
  *** blosclz , bitshuffle ***  0.094 s (7.94 GB/s) / 0.041 s (18.28 GB/s)    Compr. ratio:  74.0x
  *** lz4     , noshuffle  ***  0.156 s (4.79 GB/s) / 0.052 s (14.30 GB/s)    Compr. ratio:   2.0x
  *** lz4     , shuffle    ***  0.033 s (22.58 GB/s) / 0.034 s (22.03 GB/s)   Compr. ratio:  68.6x
  *** lz4     , bitshuffle ***  0.059 s (12.63 GB/s) / 0.053 s (14.18 GB/s)   Compr. ratio:  33.1x
  *** lz4hc   , noshuffle  ***  0.443 s (1.68 GB/s) / 0.070 s (10.62 GB/s)    Compr. ratio:   2.0x
  *** lz4hc   , shuffle    ***  0.102 s (7.31 GB/s) / 0.029 s (25.42 GB/s)    Compr. ratio:  97.5x
  *** lz4hc   , bitshuffle ***  0.206 s (3.62 GB/s) / 0.038 s (19.85 GB/s)    Compr. ratio: 180.5x
  *** snappy  , noshuffle  ***  0.154 s (4.84 GB/s) / 0.056 s (13.28 GB/s)    Compr. ratio:   2.0x
  *** snappy  , shuffle    ***  0.044 s (16.89 GB/s) / 0.047 s (15.95 GB/s)   Compr. ratio:  17.4x
  *** snappy  , bitshuffle ***  0.064 s (11.58 GB/s) / 0.061 s (12.26 GB/s)   Compr. ratio:  18.2x
  *** zlib    , noshuffle  ***  1.172 s (0.64 GB/s) / 0.135 s (5.50 GB/s)     Compr. ratio:   5.3x
  *** zlib    , shuffle    ***  0.260 s (2.86 GB/s) / 0.086 s (8.67 GB/s)     Compr. ratio: 120.8x
  *** zlib    , bitshuffle ***  0.262 s (2.84 GB/s) / 0.094 s (7.96 GB/s)     Compr. ratio: 260.1x
  *** zstd    , noshuffle  ***  0.973 s (0.77 GB/s) / 0.093 s (8.00 GB/s)     Compr. ratio:   7.8x
  *** zstd    , shuffle    ***  0.093 s (7.97 GB/s) / 0.023 s (32.71 GB/s)    Compr. ratio: 156.7x
  *** zstd    , bitshuffle ***  0.115 s (6.46 GB/s) / 0.029 s (25.60 GB/s)    Compr. ratio: 320.6x

*** the linspace linear distribution ***
  *** blosclz , noshuffle  ***  0.341 s (2.19 GB/s) / 0.291 s (2.56 GB/s)     Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.132 s (5.65 GB/s) / 0.023 s (33.10 GB/s)    Compr. ratio:   2.0x
  *** blosclz , bitshuffle ***  0.166 s (4.50 GB/s) / 0.036 s (20.89 GB/s)    Compr. ratio:   2.8x
  *** lz4     , noshuffle  ***  0.142 s (5.26 GB/s) / 0.028 s (27.07 GB/s)    Compr. ratio:   1.0x
  *** lz4     , shuffle    ***  0.093 s (8.01 GB/s) / 0.030 s (24.87 GB/s)    Compr. ratio:   3.4x
  *** lz4     , bitshuffle ***  0.102 s (7.31 GB/s) / 0.039 s (19.13 GB/s)    Compr. ratio:   5.3x
  *** lz4hc   , noshuffle  ***  0.700 s (1.06 GB/s) / 0.044 s (16.77 GB/s)    Compr. ratio:   1.1x
  *** lz4hc   , shuffle    ***  0.203 s (3.67 GB/s) / 0.021 s (36.22 GB/s)    Compr. ratio:   8.6x
  *** lz4hc   , bitshuffle ***  0.342 s (2.18 GB/s) / 0.028 s (26.50 GB/s)    Compr. ratio:  14.2x
  *** snappy  , noshuffle  ***  0.271 s (2.75 GB/s) / 0.274 s (2.72 GB/s)     Compr. ratio:   1.0x
  *** snappy  , shuffle    ***  0.099 s (7.54 GB/s) / 0.042 s (17.55 GB/s)    Compr. ratio:   4.2x
  *** snappy  , bitshuffle ***  0.127 s (5.86 GB/s) / 0.043 s (17.20 GB/s)    Compr. ratio:   6.1x
  *** zlib    , noshuffle  ***  1.525 s (0.49 GB/s) / 0.158 s (4.70 GB/s)     Compr. ratio:   1.6x
  *** zlib    , shuffle    ***  0.346 s (2.15 GB/s) / 0.098 s (7.59 GB/s)     Compr. ratio:  10.7x
  *** zlib    , bitshuffle ***  0.420 s (1.78 GB/s) / 0.104 s (7.20 GB/s)     Compr. ratio:  18.0x
  *** zstd    , noshuffle  ***  1.061 s (0.70 GB/s) / 0.096 s (7.79 GB/s)     Compr. ratio:   1.9x
  *** zstd    , shuffle    ***  0.203 s (3.68 GB/s) / 0.052 s (14.21 GB/s)    Compr. ratio:  14.2x
  *** zstd    , bitshuffle ***  0.251 s (2.97 GB/s) / 0.047 s (15.84 GB/s)    Compr. ratio:  22.2x

*** the random distribution ***
  *** blosclz , noshuffle  ***  0.340 s (2.19 GB/s) / 0.285 s (2.61 GB/s)     Compr. ratio:   1.0x
  *** blosclz , shuffle    ***  0.091 s (8.21 GB/s) / 0.017 s (44.29 GB/s)    Compr. ratio:   3.9x
  *** blosclz , bitshuffle ***  0.080 s (9.27 GB/s) / 0.029 s (26.12 GB/s)    Compr. ratio:   6.1x
  *** lz4     , noshuffle  ***  0.150 s (4.95 GB/s) / 0.027 s (28.05 GB/s)    Compr. ratio:   2.4x
  *** lz4     , shuffle    ***  0.068 s (11.02 GB/s) / 0.029 s (26.03 GB/s)   Compr. ratio:   4.5x
  *** lz4     , bitshuffle ***  0.063 s (11.87 GB/s) / 0.054 s (13.70 GB/s)   Compr. ratio:   6.2x
  *** lz4hc   , noshuffle  ***  0.645 s (1.15 GB/s) / 0.019 s (39.22 GB/s)    Compr. ratio:   3.5x
  *** lz4hc   , shuffle    ***  0.257 s (2.90 GB/s) / 0.022 s (34.62 GB/s)    Compr. ratio:   5.1x
  *** lz4hc   , bitshuffle ***  0.128 s (5.80 GB/s) / 0.029 s (25.52 GB/s)    Compr. ratio:   6.2x
  *** snappy  , noshuffle  ***  0.164 s (4.54 GB/s) / 0.048 s (15.46 GB/s)    Compr. ratio:   2.2x
  *** snappy  , shuffle    ***  0.082 s (9.09 GB/s) / 0.043 s (17.39 GB/s)    Compr. ratio:   4.3x
  *** snappy  , bitshuffle ***  0.071 s (10.48 GB/s) / 0.046 s (16.08 GB/s)   Compr. ratio:   5.0x
  *** zlib    , noshuffle  ***  1.223 s (0.61 GB/s) / 0.093 s (7.97 GB/s)     Compr. ratio:   4.0x
  *** zlib    , shuffle    ***  0.636 s (1.17 GB/s) / 0.126 s (5.89 GB/s)     Compr. ratio:   5.5x
  *** zlib    , bitshuffle ***  0.327 s (2.28 GB/s) / 0.109 s (6.81 GB/s)     Compr. ratio:   6.2x
  *** zstd    , noshuffle  ***  1.432 s (0.52 GB/s) / 0.103 s (7.27 GB/s)     Compr. ratio:   4.2x
  *** zstd    , shuffle    ***  0.388 s (1.92 GB/s) / 0.031 s (23.71 GB/s)    Compr. ratio:   5.9x
  *** zstd    , bitshuffle ***  0.127 s (5.86 GB/s) / 0.033 s (22.77 GB/s)    Compr. ratio:   6.4x

Also, Blosc works quite well on ARM processors (even without NEON support yet):

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
python-blosc version: 1.4.4
Blosc version: 1.11.2 ($Date:: 2017-01-27 #$)
Compressors available: ['blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd']
Compressor library versions:
  BloscLZ: 1.0.5
  LZ4: 1.7.5
  Snappy: 1.1.1
  Zlib: 1.2.8
  Zstd: 1.1.2
Python version: 3.6.0 (default, Dec 31 2016, 21:20:16)
[GCC 4.9.2]
Platform: Linux-3.4.113-sun8i-armv7l (#50 SMP PREEMPT Mon Nov 14 08:41:55 CET 2016)
Linux dist: debian 9.0
Processor: not recognized
Byte-ordering: little
Detected cores: 4
Number of threads to use by default: 4
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
  *** ctypes.memmove() *** Time for memcpy():   0.015 s (93.57 MB/s)

Times for compressing/decompressing with clevel=5 and 4 threads

*** user input ***
  *** blosclz , noshuffle  ***  0.015 s (89.93 MB/s) / 0.010 s (138.32 MB/s)    Compr. ratio:   2.7x
  *** blosclz , shuffle    ***  0.023 s (60.25 MB/s) / 0.012 s (112.71 MB/s)    Compr. ratio:   2.3x
  *** blosclz , bitshuffle ***  0.018 s (77.63 MB/s) / 0.021 s (66.76 MB/s)     Compr. ratio:   7.3x
  *** lz4     , noshuffle  ***  0.008 s (177.14 MB/s) / 0.009 s (159.00 MB/s)   Compr. ratio:   3.6x
  *** lz4     , shuffle    ***  0.010 s (131.29 MB/s) / 0.012 s (117.69 MB/s)   Compr. ratio:   3.5x
  *** lz4     , bitshuffle ***  0.015 s (89.97 MB/s) / 0.022 s (63.62 MB/s)     Compr. ratio:   8.4x
  *** lz4hc   , noshuffle  ***  0.071 s (19.30 MB/s) / 0.007 s (186.64 MB/s)    Compr. ratio:   8.6x
  *** lz4hc   , shuffle    ***  0.079 s (17.30 MB/s) / 0.014 s (95.99 MB/s)     Compr. ratio:   6.2x
  *** lz4hc   , bitshuffle ***  0.062 s (22.23 MB/s) / 0.027 s (51.53 MB/s)     Compr. ratio:   9.7x
  *** snappy  , noshuffle  ***  0.008 s (173.87 MB/s) / 0.009 s (148.77 MB/s)   Compr. ratio:   4.4x
  *** snappy  , shuffle    ***  0.011 s (123.22 MB/s) / 0.016 s (85.16 MB/s)    Compr. ratio:   4.4x
  *** snappy  , bitshuffle ***  0.015 s (89.02 MB/s) / 0.021 s (64.87 MB/s)     Compr. ratio:   6.2x
  *** zlib    , noshuffle  ***  0.047 s (29.26 MB/s) / 0.011 s (121.83 MB/s)    Compr. ratio:  14.7x
  *** zlib    , shuffle    ***  0.080 s (17.20 MB/s) / 0.022 s (63.61 MB/s)     Compr. ratio:   9.4x
  *** zlib    , bitshuffle ***  0.059 s (23.50 MB/s) / 0.033 s (41.10 MB/s)     Compr. ratio:  10.5x
  *** zstd    , noshuffle  ***  0.113 s (12.21 MB/s) / 0.011 s (124.64 MB/s)    Compr. ratio:  15.6x
  *** zstd    , shuffle    ***  0.154 s (8.92 MB/s) / 0.026 s (52.56 MB/s)      Compr. ratio:   9.9x
  *** zstd    , bitshuffle ***  0.116 s (11.86 MB/s) / 0.036 s (38.40 MB/s)     Compr. ratio:  11.4x

For details on the ARM benchmark see: #105

In case you find your own results interesting, please report them back to the authors!

License

The software is licensed under a 3-Clause BSD license. A copy of the python-blosc license can be found in LICENSE.txt.

Mailing list

Discussion about this module is welcome in the Blosc list:

[email protected]

https://groups.google.com/g/blosc


Enjoy data!

More Repositories

1

c-blosc

A blocking, shuffling and loss-less compression library that can be faster than `memcpy()`.
C
957
star
2

bcolz

A columnar data container that can be compressed.
C
955
star
3

c-blosc2

A fast, compressed, persistent binary data store library for C.
C
400
star
4

bloscpack

Command line interface to and serialization format for Blosc
Python
120
star
5

python-blosc2

Jupyter Notebook
63
star
6

hdf5-blosc

Filter for HDF5 that uses Blosc
C
42
star
7

python-caterva

Python wrapper for Caterva. Still preliminary.
Python
20
star
8

movielens-bench

Datafiles for the MovieLens for benchmarking purposes
Jupyter Notebook
11
star
9

JBlosc

Java interface for Blosc library
HTML
5
star
10

JBlosc2

Java interface for Blosc2 library
Java
5
star
11

Blosc2-Btune

BTUNE plugin for Blosc2. Automatically choose the best codec/filter for your data.
C
4
star
12

b2h5py

Transparent optimized reading of n-dimensional Blosc2 slices for h5py
Python
4
star
13

pycblosc

A low level Python interface to the C-Blosc library
Python
3
star
14

blosc2-htj2k

Playground for Blosc2 and HTJ2K
C++
3
star
15

BTune

Optimize Blosc2 parameters using deep/machine learning
C
3
star
16

pycblosc2

A simple Python/CFFI interface for the C-Blosc2 library
Python
3
star
17

subtree-merge-blosc

Script to automatically subtree merge a specifc version of blosc.
Shell
2
star
18

caterva-scipy21

Caterva poster for SciPy Conference 2021!
Jinja
2
star
19

python-blosc-wheels

Shell
2
star
20

blosc2_grok

Blosc2 plugin for grok
Jupyter Notebook
2
star
21

Caterva2

REST and on-demand access to local/remote Blosc2 data repositories
Python
2
star
22

python-blosc-conda-recipe

Conda recipe for python-blosc
Shell
1
star
23

bloscpack-benchmarking

Python
1
star
24

bcolz-conda-recipe

Conda recipe for bcolz
Shell
1
star
25

governance

The governance process and model for Project Blosc
1
star
26

blosc2_openhtj2k

Dynamic plugin for OpenHTJ2K
C++
1
star
27

leaps-examples

Jupyter Notebook
1
star
28

Gaia

Scripts for processing Gaia datasets
Jupyter Notebook
1
star
29

blosc-projects-best-practices

Some notes on best practices for all Blosc related projects
1
star
30

blogsite

The Blogsite for Blosc
HTML
1
star
31

python-blosc2-c

A Python wrapper for the extremely fast Blosc2 compression library http://python-blosc2.blosc.org
C
1
star