• Stars
    star
    286
  • Rank 143,808 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 12 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Fast Redis Bloom Filters in Python

Python + Redis + Bloom Filter = pyreBloom

Build Status Status: Production Team: Big Data Scope: External Open Source: MIT Critical: Yes

One of Salvatore's suggestions for Redis' GETBIT and SETBIT commands is to implement bloom filters. There was an existing python project that we used for inspiration.

Notice

Important -- The most recent version uses different seed values from all previous releases. Previous releases were using srand and rand, though they are not guaranteed to yield the same values on different systems. For example, two clients compiled on different platforms with different C implementations may not necessarily agree on what's in the filters. This latest version fixes this, but will also be incompatible with filters constructed with any previous versions.

Installation

You will need hiredis installed, and a C compiler (probably GCC). You can optionally have Cython installed, which will generate the C extension code. With those things installed, it's pretty simple:

pip install -r requirements.txt
python setup.py install

Hiredis

In order to install hiredis, you can:

# From https://github.com/paulasmuth/recommendify/issues/6#issuecomment-4496616
# via https://github.com/seomoz/pyreBloom/issues/7#issuecomment-21182063
#
# On Mac:
brew install hiredis

# With Ubuntu:
apt-get install libhiredis-dev

# From source:
git clone https://github.com/redis/hiredis
cd hiredis && make && sudo make install

Usage

There are serial and batch forms for both add and contains. The batch modes are about 4-5 times faster than their serial equivalents, so use them when you can. When you instantiate a pyreBloom, you should give it a redis key name, a capacity, and an error rate:

import pyreBloom
p = pyreBloom.pyreBloom('myBloomFilter', 100000, 0.01)
# You can find out how many bits this will theoretically consume
p.bits
# And how many hashes are needed to satisfy the false positive rate
p.hashes

From that point, you can add elements quite easily:

tests = ['hello', 'how', 'are', 'you', 'today']
p.extend(tests)

The batch mode of contains differs from the serial version in that it actually returns which elements are in the bloom filter:

p.contains('hello')
# True
p.contains(['hello', 'whats', 'new', 'with', 'you'])
# ['hello', 'you']
'hello' in p
# True

The Story

We needed to keep track of sets of urls that we had seen when crawling web pages, and had previously been keeping track of them in redis sets. Redis sets are, after all, extremely fast. As you can see in the benchmarks, set insertions can handle about 500k 20-character insertions per second. That is performant.

However, these sets of urls got to be prohibitively large. But, since we didn't really need to know which urls we had seen but merely whether or not we had seen a given url, we started inserting hashes of urls into redis sets. Unfortunately, even these got to be prohibitively large. We tried a lot of things, including limiting the number of discovered urls, but we also thought about using bloom filters.

There was an existing library to use redis strings as bloom filters, but it wasn't inserting elements fast enough for our liking. By implementing our hash functions in pure C we were able to double our performance. Using the C bindings for redis (hiredis), we were able to squeeze another 5x performance boost, for a total of about 10x over the original implementation.

Rough Bench

Here are numbers from the benchmark script run on a 2011-ish MacBook Pro and Redis 2.4.0, inserting 10k 20-character psuedo-random words:

Generating 20000 random test words
Generated random test words in 0.365890s
Filter using 4 hash functions and 95850 bits
Batch insert : 0.209492s (47734.526951 words / second)
Serial insert: 0.770047s (12986.217154 words / second)
Batch test   : 0.170484s (58656.590137 words / second)
Serial test  : 0.728285s (13730.886920 words / second)
False positive rate: 0.012300 (0.100000 expected)
Redis set add  : 0.023647s (422885.373502 words / second)
Redis pipe chk : 0.244068s (40972.163611 words / second)
Redis pipe sadd: 0.241150s (41467.941791 words / second)
Redis pipe chk : 0.240877s (41514.979051 words / second)

While set insertions are much faster than our bloom filter insertions (this is mostly do to the fact that there's not a 'SETMBIT' command), the pipelined versions of 'sadd' and checking for membership in the set are actually a little slower than the bloom filter implementation. Win some, lose some.

More Repositories

1

shovel

Rake, for Python
Python
664
star
2

simhash-py

Simhash and near-duplicate detection
Python
377
star
3

qless

Queue / Pipeline Management
Ruby
292
star
4

interpol

A toolkit for working with API endpoint definition files, giving you a stub app, a schema validation middleware, and browsable documentation.
HTML
187
star
5

word2gauss

Gaussian word embeddings
Python
186
star
6

reppy

Modern robots.txt Parser for Python
Python
178
star
7

SEOmozAPISamples

Mozscape API sample code
Java
158
star
8

simhash-cpp

Simhashing in C++
C++
121
star
9

url-py

URL Transformation, Sanitization
Python
102
star
10

qless-core

Core Lua Scripts for qless
Python
83
star
11

simhash-db-py

Python API for Various DB-Backed Simhash Clusters
Python
63
star
12

qless-py

Python Bindings for qless
Python
48
star
13

qdr

Query-Document Relevance
Python
43
star
14

dragnet_data

Training/test data for Dragnet
Shell
41
star
15

publicsuffix-elixir

Elixir library providing public suffix logic based on publicsuffix.org data
Elixir
38
star
16

linkscape-gem

Provides an interface to SEOmoz's suite of APIs, including the free and site intelligence APIs.
Ruby
38
star
17

simhash-cluster

A cluster implementation of simhash near-duplicate detection
Python
33
star
18

Social-Authority-SDK

Ruby
33
star
19

s3po

Your Friendly Asynchronous S3 Upload Protocol Droid
Python
30
star
20

GWT-keyword-analysis

Analysis of Google Webmaster Tools search data
Python
25
star
21

g-crawl-py

Gevent Crawling in Python, with Utilities
Python
23
star
22

mozsci

Data science tools from Moz
Python
22
star
23

url-cpp

C++ bindings for url parsing and sanitization
C++
19
star
24

vocab

Vocabulary using n-grams
Python
16
star
25

uri_parser

A fast URI parser that wraps Google's chromium URL canonicalization library
C++
13
star
26

downpour

Fetch urls quickly and asynchronously with Twisted, honoring politeness.
Python
13
star
27

rep-cpp

Robot exclusion protocol in C++
C++
12
star
28

mltk

mltk - Moz Language Tool Kit
Python
12
star
29

plines

Easily create job pipelines out of declared job dependencies using Qless.
Ruby
10
star
30

awssh

AWSSH Config
Python
9
star
31

roger-mesos

A complete mesos cluster setup with automatic load balancing
Python
8
star
32

linkscape-py

Python Bindings for Linkscape's API
Python
5
star
33

qless-js

Node.js bindings for qless
JavaScript
5
star
34

roger-bamboo

Roger's internal load balancer and frontend proxy. Based on https://github.com/QubitProducts/bamboo
Go
5
star
35

gzippy

Gzip files in python
Python
4
star
36

asis

Lightweight As-Is Server
Python
4
star
37

awscpp

AWS C++ Bindings
C++
3
star
38

rack-authenticate

Rack middleware that handles basic auth and HMAC auth
Ruby
3
star
39

elasticsearch-utils

Some elasticsearch utilities I've put together / been using in investigating elasticsearch performance
Python
3
star
40

pyjudy

Python bindings to libJudy
Python
3
star
41

resque-unfairly

A Resque plugin for processing queues from random jobs based on queue weightings. Inspired by resque-fairly.
Ruby
3
star
42

roger-monitoring

Monitoring stack for RogerOS
Python
3
star
43

crawl-curio-cabinet

A Curio Cabinet of the Odd Behaviors We've Seen on the Internet
HTML
3
star
44

qless-docker

Create a qless docker image!
Ruby
2
star
45

irobot

robots.txt file inspection
Ruby
2
star
46

bloomfilter-py

Simple and fast Bloom filter
Python
2
star
47

docker-sortdb

Docker setup for SortDB
Shell
1
star
48

qless-java

qless java binding
Java
1
star
49

zendesk-search

Search for tags and such in zendesk
JavaScript
1
star
50

deb-swift

1
star
51

fiji

Cell schemas and schema versioning for HBase
HTML
1
star
52

p5-Webservice-Followerwonk-SocialAuthority

Perl Client for The Followerwonk Social Authority API
Perl
1
star
53

qless-util-py

Utilities for use with qless-py
Python
1
star
54

process_tree_dictionary

Implements a dictionary that is scoped to a process tree for Erlang and Elixir.
Elixir
1
star
55

moz_nav

DEPRECATED. Common navigation and layout across all SEOmoz applications
Ruby
1
star
56

logtools

Stuff for reading crawler log files. Probably not of much interest to those outside of SeoMOZ.
Python
1
star