• Stars
    star
    189
  • Rank 204,649 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Transparent and persistent cache/serialization powered by type hints

What is Cachew?

TLDR: cachew lets you cache function calls into an sqlite database on your disk in a matter of single decorator (similar to functools.lru_cache). The difference from functools.lru_cache is that cached data is persisted between program runs, so next time you call your function, it will only be a matter of reading from the cache. Cache is invalidated automatically if your function's arguments change, so you don't have to think about maintaining it.

In order to be cacheable, your function needs to return (an Iterator, that is a generator, tuple or list) of simple data types:

That allows to automatically infer schema from type hints (PEP 526) and not think about serializing/deserializing.

Motivation

I often find myself processing big chunks of data, merging data together, computing some aggregates on it or extracting few bits I'm interested at. While I'm trying to utilize REPL as much as I can, some things are still fragile and often you just have to rerun the whole thing in the process of development. This can be frustrating if data parsing and processing takes seconds, let alone minutes in some cases.

Conventional way of dealing with it is serializing results along with some sort of hash (e.g. md5) of input files, comparing on the next run and returning cached data if nothing changed.

Simple as it sounds, it is pretty tedious to do every time you need to memorize some data, contaminates your code with routine and distracts you from your main task.

Examples

Processing Wikipedia

Imagine you're working on a data analysis pipeline for some huge dataset, say, extracting urls and their titles from Wikipedia archive. Parsing it (extract_links function) takes hours, however, as long as the archive is same you will always get same results. So it would be nice to be able to cache the results somehow.

With this library your can achieve it through single @cachew decorator.

>>> from typing import NamedTuple, Iterator
>>> class Link(NamedTuple):
...     url : str
...     text: str
...
>>> @cachew
... def extract_links(archive_path: str) -> Iterator[Link]:
...     for i in range(5):
...         # simulate slow IO
...         # this function runs for five seconds for the purpose of demonstration, but realistically it might take hours
...         import time; time.sleep(1)
...         yield Link(url=f'http://link{i}.org', text=f'text {i}')
...
>>> list(extract_links(archive_path='wikipedia_20190830.zip')) # that would take about 5 seconds on first run
[Link(url='http://link0.org', text='text 0'), Link(url='http://link1.org', text='text 1'), Link(url='http://link2.org', text='text 2'), Link(url='http://link3.org', text='text 3'), Link(url='http://link4.org', text='text 4')]

>>> from timeit import Timer
>>> res = Timer(lambda: list(extract_links(archive_path='wikipedia_20190830.zip'))).timeit(number=1)
... # second run is cached, so should take less time
>>> print(f"call took {int(res)} seconds")
call took 0 seconds

>>> res = Timer(lambda: list(extract_links(archive_path='wikipedia_20200101.zip'))).timeit(number=1)
... # now file has changed, so the cache will be discarded
>>> print(f"call took {int(res)} seconds")
call took 5 seconds

When you call extract_links with the same archive, you start getting results in a matter of milliseconds, as fast as sqlite reads it.

When you use newer archive, archive_path changes, which will make cachew invalidate old cache and recompute it, so you don't need to think about maintaining it separately.

Incremental data exports

This is my most common usecase of cachew, which I'll illustrate with example.

I'm using an environment sensor to log stats about temperature and humidity. Data is synchronized via bluetooth in the sqlite database, which is easy to access. However sensor has limited memory (e.g. 1000 latest measurements). That means that I end up with a new database every few days, each of them containing only a slice of data I need, e.g.:

...
20190715100026.db
20190716100138.db
20190717101651.db
20190718100118.db
20190719100701.db
...

To access all of historic temperature data, I have two options:

  • Go through all the data chunks every time I wan to access them and 'merge' into a unified stream of measurements, e.g. something like:

    def measurements(chunks: List[Path]) -> Iterator[Measurement]:
        for chunk in chunks:
            # read measurements from 'chunk' and yield unseen ones
    

    This is very easy, but slow and you waste CPU for no reason every time you need data.

  • Keep a 'master' database and write code to merge chunks in it.

    This is very efficient, but tedious:

    • requires serializing/deserializing data -- boilerplate
    • requires manually managing sqlite database -- error prone, hard to get right every time
    • requires careful scheduling, ideally you want to access new data without having to refresh cache

Cachew gives the best of two worlds and makes it both easy and efficient. The only thing you have to do is to decorate your function:

@cachew      
def measurements(chunks: List[Path]) -> Iterator[Measurement]:
    # ...
  • as long as chunks stay same, data stays same so you always read from sqlite cache which is very fast

  • you don't need to maintain the database, cache is automatically refreshed when chunks change (i.e. you got new data)

    All the complexity of handling database is hidden in cachew implementation.

How it works

Basically, your data objects get flattened out and python types are mapped onto sqlite types and back.

When the function is called, cachew computes the hash of your function's arguments and compares it against the previously stored hash value.

  • If they match, it would deserialize and yield whatever is stored in the cache database
  • If the hash mismatches, the original function is called and new data is stored along with the new hash

Features

Performance

Updating cache takes certain overhead, but that would depend on how complicated your datatype in the first place, so I'd suggest measuring if you're not sure.

During reading cache all that happens is reading rows from sqlite and mapping them onto your target datatype, so the only overhead would be from reading sqlite, which is quite fast.

I haven't set up proper benchmarks/performance regressions yet, so don't want to make specific claims, however that would almost certainly make your programm faster if computations take more than several seconds.

If you want to experiment for youself, check out tests.test_many

Using

See docstring for up-to-date documentation on parameters and return types. You can also use extensive unit tests as a reference.

Some useful (but optional) arguments of @cachew decorator:

  • cache_path can be a directory, or a callable that returns a path and depends on function's arguments.

    By default, settings.DEFAULT_CACHEW_DIR is used.

  • depends_on is a function which determines whether your inputs have changed, and the cache needs to be invalidated.

    By default it just uses string representation of the arguments, you can also specify a custom callable.

    For instance, it can be used to discard cache if the input file was modified.

  • cls is the type that would be serialized.

    By default, it is inferred from return type annotations, but can be specified if you don't control the code you want to cache.

Installing

Package is available on pypi.

pip3 install --user cachew

Developing

I'm using tox to run tests, and Github Actions for CI.

Implementation

  • why tuples and dataclasses?

    Tuples are natural in Python for quickly grouping together return results. NamedTuple and dataclass specifically provide a very straightforward and self documenting way to represent data in Python. Very compact syntax makes it extremely convenient even for one-off means of communicating between couple of functions.

    If you want to find out more why you should use more dataclasses in your code I suggest these links:

  • why not pickle?

    Pickling is a bit heavyweight for plain data class. There are many reports of pickle being slower than even JSON and it's also security risk. Lastly, it can only be loaded via Python, whereas sqlite has numerous bindings and tools to explore and interface.

  • why sqlite database for storage?

    It's pretty efficient and sequence of namedtuples maps onto database rows in a very straightforward manner.

  • why not pandas.DataFrame?

    DataFrames are great and can be serialised to csv or pickled. They are good to have as one of the ways you can interface with your data, however hardly convenient to think about it abstractly due to their dynamic nature. They also can't be nested.

  • why not ORM?

    ORMs tend to be pretty invasive, which might complicate your scripts or even ruin performance. It's also somewhat an overkill for such a specific purpose.

    • E.g. SQLAlchemy requires you using custom sqlalchemy specific types and inheriting a base class. Also it doesn't support nested types.
  • why not marshmallow?

    Marshmallow is a common way to map data into db-friendly format, but it requires explicit schema which is an overhead when you have it already in the form of type annotations. I've looked at existing projects to utilize type annotations, but didn't find them covering all I wanted:

Tips and tricks

Optional dependency

You can benefit from cachew even if you don't want to bloat your app's dependencies. Just use the following snippet:

def mcachew(*args, **kwargs):
    """
    Stands for 'Maybe cachew'.
    Defensive wrapper around @cachew to make it an optional dependency.
    """
    try:
        import cachew
    except ModuleNotFoundError:
        import warnings
        warnings.warn('cachew library not found. You might want to install it to speed things up. See https://github.com/karlicoss/cachew')
        return lambda orig_func: orig_func
    else:
        return cachew.cachew(*args, **kwargs)

Now you can use @mcachew in place of @cachew, and be certain things don't break if cachew is missing.

Settings

cachew.settings exposes some parameters that allow you to control cachew behaviour:

  • ENABLE: set to False if you want to disable caching for without removing the decorators (useful for testing and debugging). You can also use cachew.extra.disabled_cachew context manager to do it temporarily.
  • DEFAULT_CACHEW_DIR: override to set a different base directory. The default is the "user cache directory" (see appdirs docs).
  • THROW_ON_ERROR: by default, cachew is defensive and simply attemps to cause the original function on caching issues. Set to True to catch errors earlier.

Updating this readme

This is a literate readme, implemented as a Jupiter notebook: README.ipynb. To update the (autogenerated) README.md, use generate-readme script.

More Repositories

1

promnesia

Another piece of your extended mind
Python
1,534
star
2

HPI

Human Programming Interface 🧑👽🤖
Python
1,252
star
3

cloudmacs

Selfhost your Emacs and access it in browser
Shell
431
star
4

orgparse

Python module for reading Emacs org-mode files
Python
323
star
5

grasp

A reliable org-capture browser extension for Chrome/Firefox
JavaScript
286
star
6

orger

Tool to convert data into searchable and interactive org-mode views
Python
274
star
7

pockexport

Export/access your Pocket data, including highlights!
Python
148
star
8

rexport

Reddit takeout: export your account data as JSON: comments, submissions, upvotes etc. 🦖
Python
137
star
9

kobuddy

Kobo database backup and parser: extracts notes, highlights, reading progress and more
Python
104
star
10

beepb00p

My blog!
Python
73
star
11

exobrain

My external brain 🧠
67
star
12

fbmessengerexport

Export/access you Messenger/Facebook chat messages
Python
60
star
13

ghexport

Export your Github activity: events, repositories, stars, etc.
Python
39
star
14

myinfra

A diagram of my personal infrastructure
Python
38
star
15

dron

What if cron and systemd had a baby?
Python
35
star
16

hypexport

Export/access your Hypothes.is data: annotations and profile info
Python
33
star
17

spotifyexport

Export your personal Spotify data: playlists, saved tracks/albums/shows, etc. as JSON
Python
29
star
18

instapexport

Export your personal Instapaper data: bookmarked articles and highlights
Python
27
star
19

arctee

Atomic tee
Python
24
star
20

telegram-backup-to-txt

Tool to dump telegram into text files for quick search (e.g. with grep)
Python
23
star
21

axol

Personal news feed: search for results on Reddit/Pinboard/Twitter/Hackernews and read as RSS
Python
23
star
22

pinbexport

Export your bookmarks from Pinboard
Python
22
star
23

inorganic

Convert python structures into org-mode
Python
21
star
24

dashboard

Python
18
star
25

goodrexport

Goodreads data export
Python
17
star
26

dotemacs

Emacs config (Doom/Spacemacs) + supplementary files and scripts
Emacs Lisp
14
star
27

telegram2org

Tool to create TODO tasks from Telegram messages in one tap
Python
14
star
28

bleanser

Tool for cleaning old and redundant backups
Python
11
star
29

blinkist-backup

Extract your blinkist hightlights and library books
Python
11
star
30

open-in-editor

Script to jump into files in you text editor, from your web browser
Python
9
star
31

exobrain-md

8
star
32

stexport

Export and access your Stackexchange data
Python
8
star
33

rescuexport

Export/access your Rescuetime data
Python
8
star
34

vkdump

Script for VK.com backup
Python
7
star
35

sufs

Merge multiple directories into one via symlinks
Python
7
star
36

scrapyroo

Full text search over deliveroo restaurants
JavaScript
6
star
37

exobrain-compiler

Scripts I'm using to generate my exobrain
Emacs Lisp
6
star
38

cofunctor-pl

Haskell
5
star
39

autohash

AutoValue extension which speeds up `hashCode` calculation for immutable objects
Java
5
star
40

monzoexport

Tool to export your Monzo transactions
Python
4
star
41

checker-fenum-android-demo

Demo setup for using Checker Framework custom @Fenum annotations in your Android project
Java
4
star
42

pymplate

My Python project template
Python
4
star
43

emfitexport

Python
4
star
44

exports

Various data export scripts that don't deserve a separate repository yet
Python
3
star
45

endoexport

Export/access your Endomondo data
Python
3
star
46

hpi-personal-overlay

Python
3
star
47

mreddit

Simple script to check whether some of your subreddits are not in a multireddit
Python
3
star
48

nordvpn-kill-switch

Tool to prevent DNS leaks. Discontinued in favor of jotyGill/openpyn-nordvpn
Shell
3
star
49

beepb00p-raw

Raw plaintext export of my blog posts
3
star
50

qm-reverse-engineering

Reverse engineering quantified-mind.com
Python
2
star
51

RobolectricPowermock

Java
2
star
52

hpi_fission_talk

Talk on 20210422
CSS
2
star
53

bt-wifi-reconnect

Make BT Wifi great again
Python
2
star
54

kython

A collection of common python stuff I use
Python
2
star
55

scripts

Various personal scripts
Python
2
star
56

porg

Library for xpath-like org-mode queries
Python
2
star
57

haveibeenpwned

Script to track changes on haveibeenpwned (discontinued in favor of https://monitor.firefox.com)
Python
2
star
58

ruci

Rust
1
star
59

karlicoss.github.io

Staging for my blog
HTML
1
star
60

mydata

Public bits of my personal data
1
star
61

gcal-quickeradd

Python
1
star
62

lagrangians

Jupyter Notebook
1
star
63

my-awesome-list

Awesome stuff I am using
1
star
64

masters-thesis

TeX
1
star
65

exobrain-logseq

CSS
1
star
66

exporthelpers

Python
1
star
67

hsbc-parser

Extract transaction data from HSBC credit card PDF statements
Python
1
star
68

dominatepp

Dominate++
Python
1
star
69

hypothesis-top-annotators

Python
1
star
70

scrapyroo-slides

CSS
1
star
71

android-template

My empty Android project template
Groovy
1
star
72

python_duplicate_warnings_investigation

Python
1
star
73

promnesia-demos

Binary assets for Promnesia
1
star
74

.emacs.d

My emacs config
Emacs Lisp
1
star
75

backup-trees

Python
1
star
76

rtm-backup

Script to backup your Remember The Milk account data
Python
1
star
77

dropbox-paranoid

Tool to detect Dropbox conflicts and prevent symlink mess
Python
1
star
78

syncthing-paranoid

Python
1
star