• Stars
    star
    143
  • Rank 257,007 (Top 6 %)
  • Language
    C
  • License
    Other
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Diskbased (persistent) hashtable

Disk-based hashtable

Build & test (Haskell) Build & test (Python) License: MIT

A simple disk-based hash table (i.e., persistent hash table).

It is a hashtable implemented on memory-mapped disk, so that it can be loaded with a single mmap() system call and used in memory directly (being as fast as an in-memory hashtable once it is loaded from disk).

The code is in C, wrappers are provided for Python, Haskell, and C++. The wrappers follow similar APIs with variations to accommodate the language specificity. They all use the same underlying code, so you can open a hashtable created in C from Haskell, modify it within your Haskell code, and later open the result in Python.

Cross-language functionality will only work for simple types where you can control their binary representation (64-bit integers, for example).

Reading does not touch the disk representation at all and, thus, can be done on top of read-only files or using multiple threads (and different processes will share the memory: the operating system does that for you). Writing or modifying values is, however, not thread-safe.

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

The C API relies on error codes and error strings (the &err argument above). The header file has decent documentation.

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables. Read-write operations are IO operations, read-only hashtables are pure.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python's interface is based on the struct module. For example, 'll' refers to a pair of 64-bit ints (longs):

import diskhash

tb = diskhash.StructHash(
    fname="testing.dht", 
    keysize=15, 
    structformat='ll',  # store pairs of longs
    mode='rw',
)
value = [1, 2]  # pair of longs
tb.insert("key", *value)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it's not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

Stability

This is beta software. It is good enough that I am using it, but the API can change in the future with little warning. The binary format is versioned (the magic string encodes its version, so changes can be detected and you will get an error message in the future rather than some silent misbehaviour.

Automated unit testing ensures that basic mistakes will not go uncaught.

Limitations

  • You must specify the maximum key size. This can be worked around either by pre-hashing the keys (with a strong hash) or using multiple hash tables for different key sizes. Neither is currently implemented in diskhash.

  • You cannot delete objects. This was not a necessity for my uses, so it was not implemented. A simple implementation could be done by marking objects as "deleted" in place and recompacting when the hash table size changes or with an explicit dht_gc() call. It may also be important to add functionality to shrink hashtables so as to not waste disk space.

  • The algorithm is a rather naïve implementation of linear addression. It would not be hard to switch to robin hood hashing and this may indeed happen in the near future.

License: MIT

More Repositories

1

BuildingMachineLearningSystemsWithPython

Source Code for the book Building Machine Learning Systems with Python
Python
2,109
star
2

mahotas

Computer Vision in Python
Python
811
star
3

milk

MILK: Machine Learning Toolkit
Python
610
star
4

jug

Parallel programming with Python
Python
397
star
5

django-gitcms

A git based cms for django
Python
71
star
6

imread

Read images to numpy arrays
C++
69
star
7

nixml

NIX + YAML for easy to use reproducible environments
Python
60
star
8

pymorph

Python Morphology Toolbox
Python
47
star
9

talk-python-intro

Introduction to Python (Jupyter based)
Jupyter Notebook
30
star
10

hex

Reimplementation of TeX in Haskell: pre-alpha
Haskell
30
star
11

python-image-tutorial

Python image tutorial (based on ipython notebooks)
Jupyter Notebook
29
star
12

milksets

Machine Learning Toolkit Datasets: A collection of UCI datasets with a Python interface
Python
26
star
13

Programming-for-Scientists

Source Material for a course on Programming targeted at scientists
TeX
25
star
14

pythonvision_org

django-gitcms files for http://pythonvision.org
CSS
17
star
15

mergedirs

Merge two directories without losing files
Python
16
star
16

luispedro_org

jekyll files for luispedro.org
JavaScript
12
star
17

Coelho2009_ISBI_NuclearSegmentation

Reproducible research archive for "Nuclear segmentation in microscope cell images"
TeX
9
star
18

Coelho2021_GMGCv1

Python
9
star
19

libertarian-welfare

Libertarian Welfare State: A Book I'm Writing
Python
7
star
20

dot-link

Dotted Suffix Trees
C++
7
star
21

HBC

My Version of Hal Daumé's Hierarchical Bayesian Compiler
Haskell
7
star
22

pyslic

PySLIC
Python
5
star
23

android-fuse

Mount an android device using FUSE
Python
5
star
24

irstlm

Fork of IRSTLM
C++
5
star
25

safeio

Haskell Library for safe (atomic) IO
Haskell
3
star
26

PenalizedRegression

Shell
3
star
27

vita

My Vita
TeX
3
star
28

gitpointer

Github suggest
Python
3
star
29

elgreco

Graphical Models
C++
3
star
30

mahotas-paper

Paper about mahotas
Python
3
star
31

base-user

dot files and the like so that I can set up a new computer with a couple of command line calls
TeX
3
star
32

readmagick

Read images in Python using ImageMagick++ : SUPERCEDED BY IMREAD
C++
3
star
33

Coelho2015_NetsDetermination

Reproducible code archive for "Automatic Determination of NET (Neutrophil Extracellular Traps) Coverage in Fluorescent Microscopy Images" by Coelho et al.
Python
3
star
34

website.content

Content for My Website
2
star
35

talk-scientific-communication

Talk on Scientific Oral Communication
HTML
2
star
36

safeout

Simple atomic writing for Python
Python
2
star
37

conduit-algorithms

Conduit based algorithms
Haskell
2
star
38

unpack

Unpack zip/7z/tar.* archives with a consistent interface
Python
2
star
39

imcol

Image Collection Management
Python
2
star
40

programming

Programming: An Introduction. A introductory book on computer programming.
2
star
41

waldo

Waldo Project
Python
2
star
42

gitwc

Plot number of chars, words, and lines across time in a repository
Python
2
star
43

beiraproject

Beira Project Stuff
Shell
2
star
44

jug-presentations

presentations about jug
TeX
2
star
45

Hanu

Utilities for numerics in Haskell
Haskell
2
star
46

jug-paper

A future paper about jug (written in the open)
TeX
1
star
47

fna2faa.rs

Rust
1
star
48

outsort

Generic sorting of large datasets (using temporary files as temporary space)
Haskell
1
star
49

imm-mirna

Shell
1
star
50

particles

Python
1
star
51

StructureFunctionOceanTutorial

Jupyter Notebook
1
star
52

ML-for-microbial-communities

Jupyter Notebook
1
star
53

refsweb

Web interface to refs
Python
1
star
54

rbit

rabbit mail
Python
1
star
55

tutorial-unit-testing

HTML
1
star
56

alist

Append List in Haskell
Haskell
1
star
57

whim

What Have I Missed
CoffeeScript
1
star
58

refs

Bibtex Reference Management Software
Python
1
star
59

vision

Vision and Image Processing Library for Python [superceded by mahotas—use that]
Python
1
star
60

tutorial-cluster-usage

JavaScript
1
star
61

mlsegment

C++
1
star
62

cq

Code Quarterly
Haskell
1
star
63

rabbit_blog

1
star
64

beiraproject_org

django-gitcms repo for beiraproject.org
Python
1
star
65

TestingNGLESS

Haskell
1
star
66

NGH

Next Generation Sequence Handling in Haskell
Haskell
1
star
67

redundant100

Haskell
1
star
68

image-difftool

Image diff for git on the command line
Python
1
star
69

binary-instances

Haskell instances of Data.Binary
Haskell
1
star
70

anscombe

Python
1
star
71

fautils

Haskell
1
star
72

fragile

Lightweight command line unit testing tool for nodejs (with coffescript support)
JavaScript
1
star
73

metarabbit

Posts for metarabbit.wordpress.com
Python
1
star
74

blog_luispedro_org

CSS
1
star