• Stars
    star
    143
  • Rank 257,007 (Top 6 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 13 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Java implementation of a probabilistic set data structure

greplin-bloom-filter

Greplin Bloom Filter

An Bloom Filter implementation in Java, that optionally supports persistence and counting buckets.

Status:

This is a very early stage project. It works for our needs. We haven't verified it works beyond that. Issue reports and patches are very much appreciated!

Some improvements we'd love to see include:

  • Optimized code paths for particular sized counting buckets

  • More hash functions to choose between

  • Variable cache sizes and types. Enabling, for example, a lower-memory read only mode or a smaller cache that performs disk seeks to perform some operations

  • Support on-the fly bucket expansion, possibly via a [d-left counting bloom filter] (http://theory.stanford.edu/~rinap/papers/esa2006b.pdf)

Pre-requisites:

[Maven] (http://maven.apache.org/)

Installation

git clone https://github.com/Greplin/greplin-bloom-filter.git

cd greplin-bloom-filter

mvn install

Implementation details

  • The NewBuilder and OpenBuilder expose a lot of (optionally) tunable knobs and hooks. Most users will never need to know about them though, and just supplying the required Builder constructor arguments will be fine.

  • This is a counting bloom filter that uses a configurable number of bits per bucket. If you use one bit per bucket, then items can never be deleted. If you use 8 bits per bucket, then it uses 8x more space than a non-counting filter, but items can be deleted as long as the count doesn't exceed 255 items in a bucket.

  • Instead of using N distinct hashes, we use linear combinations of two runs of a repeated murmur hash per [Kirch and Mitzenmacher] (https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf).

  • If you are using a persistent bloom filter, then on flush we intelligently decide to either rewrite the entire file or or just seek into the file and change particular bytes based on how much of the file has changed.

  • Has relatively efficient thread-safety via a ReentrantReadWriteLock

  • No external dependencies (besides JUnit - which is only needed to run the test suite)

Usage

// if the 'file' is null, the bloom filter is in-memory only, and not-persisted to disk
final File onDiskFile = new File("/tmp/greplin-bloom-filter.bin");
final int expectedItems = 10000;
final double desiredFalsePositiveRate = 0.000001;

final byte[] exampleItemA = "Hello World!".getBytes(Charset.forName("UTF-8"));
final byte[] exampleItemB = "Goodbye Cruel world".getBytes(Charset.forName("UTF-8"));

BloomFilter bloomFilter = new BloomFilter.NewBuilder(onDiskFile, expectedItems, desiredFalsePositiveRate)
    .force(true) // tells it to over-write any existing file at onDiskFile
    .build();

System.out.println(bloomFilter.contains(exampleItemA)); // false
System.out.println(bloomFilter.contains(exampleItemB)); // false

bloomFilter.add(exampleItemA);
bloomFilter.add(exampleItemB);

System.out.println(bloomFilter.contains(exampleItemA)); // true
System.out.println(bloomFilter.contains(exampleItemB)); // true

bloomFilter.remove(exampleItemB);

System.out.println(bloomFilter.contains(exampleItemA)); // true
System.out.println(bloomFilter.contains(exampleItemB)); // false

bloomFilter.close();
bloomFilter = null;


// now, let's reopen the same bloom filter from the on-disk file
bloomFilter = new BloomFilter.OpenBuilder(onDiskFile).build();

System.out.println(bloomFilter.contains(exampleItemA)); // true
System.out.println(bloomFilter.contains(exampleItemB)); // false

bloomFilter.remove(exampleItemA);

System.out.println(bloomFilter.contains(exampleItemA)); // false
System.out.println(bloomFilter.contains(exampleItemB)); // false

bloomFilter.close();

Authors

Greplin, Inc.

More Repositories

1

scales

scales - Metrics for Python
Python
920
star
2

hookshot

Instrumentation for Objective C for debugging and profiling
Objective-C
392
star
3

ocstyle

Objective-C style checker
Python
255
star
4

fast-python-pb

Fast Protocol Buffers in python (by using the C++ API)
Python
250
star
5

TheKitchenSync

A Tool Belt for iOS Concurrency
Objective-C
197
star
6

hop

Script to hop to common directories and servers
Lua
112
star
7

greplin-lucene-utils

Some utilities for Lucene
Java
111
star
8

CueTableReloader

A really handy class that automatically figures out insertions, deletions, and reloads in UITableView based on unique item keys.
Objective-C
92
star
9

greplin-exception-catcher

Exception catcher that runs on Google App Engine
Python
74
star
10

greplin-tornado-ses

An asynchronous client for Amazon SES
Python
42
star
11

greplin-zookeeper-utils

Utilities for dealing with Apache Zookeeper
Java
41
star
12

greplin-nagios-utils

Utilities for monitoring with Nagios
Python
39
star
13

lucene-interval-fields

Lucene fields and queries for interval fields.
Java
37
star
14

greplin-tornado-sendgrid

A client for the Sendgrid API
Python
32
star
15

greplin-twisted-utils

Utilities for working with Twisted
Python
28
star
16

qc

QuickCheck for Python
Python
27
star
17

greplin-tornado-stripe

Tornado bindings for Stripe's API
Python
27
star
18

polarbear

OOM diagnostics for Java.
C++
21
star
19

greplin-tornado-mixpanel

A client for the Mixpanel API
Python
20
star
20

greplin-tornado-kissmetrics

A client for the Kissmetrics API
Python
17
star
21

evernote-python-api

Packaged version of latest Evernote Python API
Python
14
star
22

htmltotext

Fork of flaxcode htmltotext module
C++
13
star
23

jsnappy

Java implementation of the Snappy compression/decompression algorithm from Google.
Java
11
star
24

phpserialize

PHP style serialize and unserialize in Python
Python
6
star
25

hegemon

(java)script utilities
Java
5
star
26

skia

Fork of Google's Skia library
C++
4
star
27

pyiso8601

Forked version of pyiso8601 - http://code.google.com/p/pyiso8601
Python
3
star
28

eventlet

Fork of eventlet with patches from Greplin
Python
3
star
29

greplin-vobject

Greplin fork of vobject
Python
2
star
30

hegemon-example

Java
1
star