• Stars
    star
    153
  • Rank 243,368 (Top 5 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created over 9 years ago
  • Updated about 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Partitioned storage system based on blosc. **No longer actively maintained.**

Castra

Build Status

Castra is an on-disk, partitioned, compressed, column store. Castra provides efficient columnar range queries.

  • Efficient on-disk: Castra stores data on your hard drive in a way that you can load it quickly, increasing the comfort of inconveniently large data.
  • Partitioned: Castra partitions your data along an index, allowing rapid loads of ranges of data like "All records between January and March"
  • Compressed: Castra uses Blosc to compress data, increasing effective disk bandwidth and decreasing storage costs
  • Column-store: Castra stores columns separately, drastically reducing I/O costs for analytic queries
  • Tabular data: Castra plays well with Pandas and is an ideal fit for append-only applications like time-series

Maintenance

This project is no longer actively maintained. Use at your own risk.

Example

Consider some Pandas DataFrames

In [1]: import pandas as pd
In [2]: A = pd.DataFrame({'price': [10.0, 11.0], 'volume': [100, 200]},
   ...:                  index=pd.DatetimeIndex(['2010', '2011']))

In [3]: B = pd.DataFrame({'price': [12.0, 13.0], 'volume': [300, 400]},
   ...:                  index=pd.DatetimeIndex(['2012', '2013']))

We create a Castra with a filename and a template dataframe from which to get column name, index, and dtype information

In [4]: from castra import Castra
In [5]: c = Castra('data.castra', template=A)

The castra starts empty but we can extend it with new dataframes:

In [6]: c.extend(A)

In [7]: c[:]
Out[7]:
            price  volume
2010-01-01     10     100
2011-01-01     11     200

In [8]: c.extend(B)

In [9]: c[:]
Out[9]:
            price  volume
2010-01-01     10     100
2011-01-01     11     200
2012-01-01     12     300
2013-01-01     13     400

We can select particular columns

In [10]: c[:, 'price']
Out[10]:
2010-01-01    10
2011-01-01    11
2012-01-01    12
2013-01-01    13
Name: price, dtype: float64

Particular ranges

In [12]: c['2011':'2013']
Out[12]:
            price  volume
2011-01-01     11     200
2012-01-01     12     300
2013-01-01     13     400

Or both

In [13]: c['2011':'2013', 'volume']
Out[13]:
2011-01-01    200
2012-01-01    300
2013-01-01    400
Name: volume, dtype: int64

Storage

Castra stores your dataframes as they arrived, you can see the divisions along which you data is divided.

In [14]: c.partitions
Out[14]:
2011-01-01    2009-12-31T16:00:00.000000000-0800--2010-12-31...
2013-01-01    2011-12-31T16:00:00.000000000-0800--2012-12-31...
dtype: object

Each column in each partition lives in a separate compressed file:

$ ls -a data.castra/2011-12-31T16:00:00.000000000-0800--2012-12-31T16:00:00.000000000-0800
.  ..  .index  price  volume

Restrictions

Castra is both fast and restrictive.

  • You must always give it dataframes that match its template (same column names, index type, dtypes).
  • You can only give castra dataframes with increasing index values. For example you can give it one dataframe a day for values on that day. You can not go back and update previous days.

Text and Categoricals

Castra tries to encode text and object dtype columns with msgpack, using the implementation found in the Pandas library. It falls back to pickle with a high protocol if that fails.

Alternatively, Castra can categorize your data as it receives it

>>> c = Castra('data.castra', template=df, categories=['list', 'of', 'columns'])

or

>>> c = Castra('data.castra', template=df, categories=True) # all object dtype columns

Categorizing columns that have repetitive text, like 'sex' or 'ticker-symbol' can greatly improve both read times and computational performance with Pandas. See this blogpost for more information.

Dask dataframe

Castra interoperates smoothly with dask.dataframe

>>> import dask.dataframe as dd
>>> df = dd.read_csv('myfiles.*.csv')
>>> df.set_index('timestamp', compute=False).to_castra('myfile.castra', categories=True)

>>> df = dd.from_castra('myfile.castra')

Work in Progress

Castra is immature and largely for experimental use.

The developers do not promise backwards compatibility with future versions. You should treat castra as a very efficient temporary format and archive your data with some other system.