• Stars
    star
    377
  • Rank 113,535 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 12 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Simhash and near-duplicate detection

Simhash Near-Duplicate Detection

Build Status

Status: Production Team: Big Data Scope: External Open Source: MIT Critical: Yes

This library enables the efficient identification of near-duplicate documents using simhash using a C++ extension.

Usage

simhash differs from most hashes in that its goal is to have two similar documents produce similar hashes, where most hashes have the goal of producing very different hashes even in the face of small changes to the input.

The input to simhash is a list of hashes representative of a document. The output is an unsigned 64-bit integer. The input list of hashes can be produced in several ways, but one common mechanism is to:

  1. tokenize the document
  2. consider overlapping shingles of these tokens (simhash.shingle)
  3. hash these overlapping shingles
  4. input these hashes into simhash.compute

This has the effect of considering phrases in a document, rather than just a bag of the words in it.

Once we've produced a simhash, we would like to compare it to other documents. For two documents to be considered near-duplicates, they must have few bits that differ. We can compare two documents:

import simhash

a = simhash.compute(...)
b = simhash.compute(...)
simhash.num_differing_bits(a, b)

One of the key advantages of simhash is that it does not require O(n^2) time to find all near-duplicate pairs from a set of hashes. Given a whole set of simhashes, we can find all pairs efficiently:

import simhash

# The `simhash`-es from our documents
hashes = []

# Number of blocks to use (more in the next section)
blocks = 4
# Number of bits that may differ in matching pairs
distance = 3
matches = simhash.find_all(hashes, blocks, distance)

All the matches returned are guaranteed to be all pairs where the hashes differ by distance bits or fewer. The blocks parameter is less intuitive, but is best described in this article or in the paper. The best parameter to choose depends on the distribution of the input simhashes, but it must always be at least one greater than the provided distance.

Internally, find_all takes blocks C distance passes to complete. The idea is that as that value increases (for instance by increasing blocks), each pass completes faster. In terms of memory, find_all takes O(hashes + matches) memory.

Building

This is installable via pip:

pip install git+https://github.com/seomoz/simhash-py.git

It can also be built from git:

git submodule update --init --recursive
python setup.py install

or

pip install simhash-py

under osx, you should

export MACOSX_DEPLOYMENT_TARGET = 10.x (10.9,10.10...)

first

Benchmark

This is a rough benchmark, but should help to give you an idea of the order of magnitude for the performance available. Running on a single core on a vagrant instance on a 2015 MacBook Pro:

$ ./bench.py --random 1000000 --blocks 5 --bits 3
Generating 1000000 hashes
Starting Find all
     Ran Find all in 1.595416s

Architecture

Each document gets associated with a 64-bit hash calculated using a rolling hash function and simhash. This hash can be thought of as a fingerprint for the content. Two documents are considered near-duplicates if their hashes differ by at most k bits, a parameter chosen by the user.

In this context, there is a large corpus of known fingerprints, and we would like to determine all the fingerprints that differ by our query by k or fewer bits. To accomplish this, we divide up the 64 bits into at m blocks, where m is greater than k. If hashes A and B differ by at most k bits, then at least m - k groups are the same.

Choosing all the unique combinations of m - k blocks, we perform a permutation on each of the hashes for the documents so that those blocks are first in the hash. Perhaps a picture would illustrate it better:

63------53|52------42|41-----32|31------21|20------10|09------0|
|    A    |     B    |    C    |     D    |     E    |    F    |

If m = 6, k = 3, we'll choose permutations:
- A B C D E F
- A B D C E F
- A B E C D F
...
- C D F A B E
- C E F A B D
- D E F A B C

This generates a number of tables that can be put into sorted order, and then a small range of candidates can be found in each of those tables for a query, and then each candidate in that range can be compared to our query.

The corpus is represented by the union of these tables, could conceivably be hosted on a separate machine. And each of these tables is also amenable to sharding, where each shard would comprise a contiguous range of numbers. For example, you might divide a table into 256 shards, where each shard is associated with each of the possible first bytes.

The best partitioning remains to be seen, likely from experimentation, but the basis of this is the table. The table tracks hashes inserted into it subject to a permutation associated with the table. This permutation is described as a vector of bitmasks of contiguous bit ranges, whose populations sum to 64.

Example

Let's suppose that our corpus has a fingerprint:

0100101110111011001000101111101110111100001010011101100110110101

and we have a query:

0100101110111011011000101111101110011100001010011100100110110101

and they differ by only three bits which happen to fall in blocks B, D and E:

63------53|52------42|41-----32|31------21|20------10|09------0|
|    A    |     B    |    C    |     D    |     E    |    F    |
|         |          |         |          |          |         |
0000000000000000010000000000000000100000000000000001000000000000

Since any fingerprint matching the query differs by at most 3 bits, at most 3 blocks can differ, and at least 3 must match. Whatever table has the 3 blocks that do not differ as the leading blocks will match the query when doing a scan. In this case, the table that's permuted A C F B D E will match. It's important to note that it's possible for a query to match from more than one table. For example, if two of the non-matching bits are in the same block, or the query differs by fewer than 3 bits.

32-Bit Systems

The only requirement of simhash-py is that it has uint64_t.

More Repositories

1

shovel

Rake, for Python
Python
664
star
2

qless

Queue / Pipeline Management
Ruby
292
star
3

pyreBloom

Fast Redis Bloom Filters in Python
Python
286
star
4

interpol

A toolkit for working with API endpoint definition files, giving you a stub app, a schema validation middleware, and browsable documentation.
HTML
187
star
5

word2gauss

Gaussian word embeddings
Python
186
star
6

reppy

Modern robots.txt Parser for Python
Python
178
star
7

SEOmozAPISamples

Mozscape API sample code
Java
158
star
8

simhash-cpp

Simhashing in C++
C++
121
star
9

url-py

URL Transformation, Sanitization
Python
102
star
10

qless-core

Core Lua Scripts for qless
Python
83
star
11

simhash-db-py

Python API for Various DB-Backed Simhash Clusters
Python
63
star
12

qless-py

Python Bindings for qless
Python
48
star
13

qdr

Query-Document Relevance
Python
43
star
14

dragnet_data

Training/test data for Dragnet
Shell
41
star
15

publicsuffix-elixir

Elixir library providing public suffix logic based on publicsuffix.org data
Elixir
38
star
16

linkscape-gem

Provides an interface to SEOmoz's suite of APIs, including the free and site intelligence APIs.
Ruby
38
star
17

simhash-cluster

A cluster implementation of simhash near-duplicate detection
Python
33
star
18

Social-Authority-SDK

Ruby
33
star
19

s3po

Your Friendly Asynchronous S3 Upload Protocol Droid
Python
30
star
20

GWT-keyword-analysis

Analysis of Google Webmaster Tools search data
Python
25
star
21

g-crawl-py

Gevent Crawling in Python, with Utilities
Python
23
star
22

mozsci

Data science tools from Moz
Python
22
star
23

url-cpp

C++ bindings for url parsing and sanitization
C++
19
star
24

vocab

Vocabulary using n-grams
Python
16
star
25

uri_parser

A fast URI parser that wraps Google's chromium URL canonicalization library
C++
13
star
26

downpour

Fetch urls quickly and asynchronously with Twisted, honoring politeness.
Python
13
star
27

rep-cpp

Robot exclusion protocol in C++
C++
12
star
28

mltk

mltk - Moz Language Tool Kit
Python
12
star
29

plines

Easily create job pipelines out of declared job dependencies using Qless.
Ruby
10
star
30

awssh

AWSSH Config
Python
9
star
31

roger-mesos

A complete mesos cluster setup with automatic load balancing
Python
8
star
32

linkscape-py

Python Bindings for Linkscape's API
Python
5
star
33

qless-js

Node.js bindings for qless
JavaScript
5
star
34

roger-bamboo

Roger's internal load balancer and frontend proxy. Based on https://github.com/QubitProducts/bamboo
Go
5
star
35

gzippy

Gzip files in python
Python
4
star
36

asis

Lightweight As-Is Server
Python
4
star
37

awscpp

AWS C++ Bindings
C++
3
star
38

rack-authenticate

Rack middleware that handles basic auth and HMAC auth
Ruby
3
star
39

elasticsearch-utils

Some elasticsearch utilities I've put together / been using in investigating elasticsearch performance
Python
3
star
40

pyjudy

Python bindings to libJudy
Python
3
star
41

resque-unfairly

A Resque plugin for processing queues from random jobs based on queue weightings. Inspired by resque-fairly.
Ruby
3
star
42

roger-monitoring

Monitoring stack for RogerOS
Python
3
star
43

crawl-curio-cabinet

A Curio Cabinet of the Odd Behaviors We've Seen on the Internet
HTML
3
star
44

qless-docker

Create a qless docker image!
Ruby
2
star
45

irobot

robots.txt file inspection
Ruby
2
star
46

bloomfilter-py

Simple and fast Bloom filter
Python
2
star
47

docker-sortdb

Docker setup for SortDB
Shell
1
star
48

qless-java

qless java binding
Java
1
star
49

zendesk-search

Search for tags and such in zendesk
JavaScript
1
star
50

deb-swift

1
star
51

fiji

Cell schemas and schema versioning for HBase
HTML
1
star
52

p5-Webservice-Followerwonk-SocialAuthority

Perl Client for The Followerwonk Social Authority API
Perl
1
star
53

qless-util-py

Utilities for use with qless-py
Python
1
star
54

process_tree_dictionary

Implements a dictionary that is scoped to a process tree for Erlang and Elixir.
Elixir
1
star
55

moz_nav

DEPRECATED. Common navigation and layout across all SEOmoz applications
Ruby
1
star
56

logtools

Stuff for reading crawler log files. Probably not of much interest to those outside of SeoMOZ.
Python
1
star