• Stars
    star
    587
  • Rank 75,636 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created almost 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

All-pair set similarity search on millions of sets in Python and on a laptop

Set Similarity Search

Python package

Efficient set similarity search algorithms in Python. For even better performance see the Go Implementation.

What is set similarity search?

Let's say we have a database of users and the books they have read. Assume that we want to recommend "friends" for each user, and the "friends" must have read very similar set of books as the user have. We can model this as a set similarity search problem, by representing each user's books as a set:

Alice: {"Anna Karenina", "War and Peace", "The Chameleon", ...}
Bob: {"Lolita", "The Metamorphosis", "The Judgement", ...}
Joey: {"Anna Karenina", "The Chameleon" ...}

A popular way to measure the similarity between two sets is Jaccard similarity, which gives a fractional score between 0 and 1.0.

There are two versions of set similarity search problem, both can be defined given a collection of sets, a similarity function and a threshold:

  1. All-Pairs: find all pairs of sets that have similarities greater than (or equal to) the threshold;
  2. Query: given a query set, from the collection of sets, find all that have similarities greater than (or equal to) the threshold with respect to the query set.

Both versions of the problem can be very computationally expensive as the collection can be large and the set sizes can be large. The simple brute-force algorithm is O(n^2) for (1) and O(n) for (2).

This package includes a Python implementation of the "All-Pair-Binary" algorithm in Scaling Up All Pairs Similarity Search paper, with additional position filter optimization. This algorithm still has the same worst-case complexity as the brute-force algorithm, however, by taking advantage of skewness in empirical distributions of set sizes and frequencies, it often runs much faster (even better than MinHash LSH).

Benchmarks

Run All-Pairs on 3.5 GHz Intel Core i7, using similarity function jaccard and similarity threshold 0.5. The running time of datasketch.MinHashLSH is also shown below for comparison (num_perm=32).

Dataset Input Sets Avg. Size SetSimilaritySearch Runtime datasketch Runtime datasketch Accuracy
Pokec social network (relationships): from-nodes are set IDs; to-nodes are elements 1432693 27.31 10m49s 11m4s Precision: 0.73; Recall: 0.67
LiveJournal: from-nodes are set IDs; to-nodes are elements 4308452 16.01 28m51s 31m58s Precision: 0.79; Recall: 0.74

Although datasketch.MinHashLSH is an approximate algorithm, and I am using num_perm=32 which is quite low, it is still a bit slower than the exact algorithm SetSimilaritySearch. The time for creating datasketch.MinHash is also included in the end-to-end time, while in practice this time can be saved through pre-computation. However, for ad hoc computation of All-Pairs, SetSimilaritySearch is still the better choice, especially when sets are small and fit in memory.

Run Query on 3.5 GHz Intel Core i7, using similarity function jaccard and similarity threshold 0.5. The query sets are sampled from the dataset itself. The running time of datasketch.MinHashLSH is also shown below for comparison (num_perm=32).

Dataset Indexed Sets Query Sets Avg. Size SetSimilaritySearch Indexing & Querying Time datasketch Indexing & Querying Time datasketch Accuracy
Pokec social network (relationships): from-nodes are set IDs; to-nodes are elements 1432693 10k 27.31 Indexing: 1m7s; Querying (90pct): 2.3ms Indexing: 9m23s; Querying (90pct): 0.72ms Precision: 0.90; Recall: 0.88
LiveJournal: from-nodes are set IDs; to-nodes are elements 4308452 10k 16.01 Indexing: 2m32s; Querying (90pct): 1.6ms Indexing: 30m58s; Querying (90pct): 2.1ms Precision: 0.85; Recall: 0.78

The indexing time for datasketch.MinHashLSH, including the time for creating datasketch.MinHash, is much worse than SetSimilaritySearch -- nearly 10x and 15x. Therefore SetSimilaritySearch is much better for ad hoc computation of the Query problem. For the scenario in which the same search index is reused for many Query problems, datasketch.MinHashLSH is faster than SetSimilaritySearch when the set sizes are large. This is easy to understand: the size of datasketch.MinHash is constant, wheres a set can be arbitrarily large, so the query time for large sets is faster when sketch is used. However, when the set sizes become smaller, the sketch looses its advantage.

Install

Pip

pip install -U SetSimilaritySearch

Conda

conda install -c conda-forge setsimilaritysearch

Library usage

For All-Pairs, it takes an input of a list of sets, and output pairs that meet the similarity threshold.

from SetSimilaritySearch import all_pairs

# The input sets must be a Python list of iterables (i.e., lists or sets).
sets = [[1,2,3], [3,4,5], [2,3,4], [5,6,7]]
# all_pairs returns an iterable of tuples.
pairs = all_pairs(sets, similarity_func_name="jaccard", 
        similarity_threshold=0.1)
list(pairs)
# [(1, 0, 0.2), (2, 0, 0.5), (2, 1, 0.5), (3, 1, 0.2)]
# Each tuple is (<index of the first set>, <index of the second set>, <similarity>).
# The indexes are the list indexes of the input sets.

For Query, it takes an input of a list of sets, and builds a search index that can compute any number of queries. Currently the search index only supports a static collection of sets with no updates.

from SetSimilaritySearch import SearchIndex

# The input sets must be a Python list of iterables (i.e., lists or sets).
sets = [[1,2,3], [3,4,5], [2,3,4], [5,6,7]]
# The search index cannot be updated.
index = SearchIndex(sets, similarity_func_name="jaccard", 
    similarity_threshold=0.1)
# The query function takes input a set.
results = index.query([5,3,4])
results
# [(1, 1.0), (0, 0.2), (2, 0.5), (3, 0.2)]
# Each tuple is (<index of the found set>, <similarity>).
# The index is the list index of the sets in the search index.

Supported similarity functions (more to come):

  • Jaccard: intersection size divided by union size; set similarity_func_name="jaccard".
  • Cosine: intersection size divided by square root of the product of sizes; set similarity_func_name="cosine".
  • Containment: intersection size divided by the size of the first set (or query set); set similarity_func_name="containment".

Command line usage

You can also use the command line program all_pairs.py. The input must be one or two files with each line a unique SetID Token tuple. For example:

# Line starts with # will be ignored.
# Each line is <Set ID> <Token (i.e. Set Element)>, separate by a whitespace or tab.
# Every line must be unique.
1 a
1 b
1 c
1 d
2 a
2 b
2 c
3 d
3 e

When one input file is given, it computes All-Pairs; when two input files are given, it computes Query by building a search index on the first collection and querying with sets from the second collection -- effectively computes cross-collection pairs.

Example usage (All-Pairs):

all_pairs.py --input-sets testdata/example_input.txt \
    --output-pairs testdata/example_output.txt \
    --similarity-func jaccard \
    --similarity-threshold 0.1

More Repositories

1

datasketch

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW
Python
2,471
star
2

lsh

Locality Sensitive Hashing for Go (Multi-probe LSH, LSH Forest, basic LSH)
Go
101
star
3

lshensemble

LSH index for approximate set containment search
Go
56
star
4

llm_maze_agent

Navigating a maze using LLM agent
Python
35
star
5

go-fasttext

Facebook fastText database in SQLite with Go API
Go
32
star
6

go-set-similarity-search

Efficient set similarity search algorithms implemented in Go
Go
29
star
7

go-sql-lsh

Locality Sensitive Hashing using Golang and SQL database
Go
27
star
8

minhash-lsh

Minhash LSH in Golang
Go
25
star
9

josie

Code and Benchmarks for JOSIE (SIGMOD 2019)
Go
18
star
10

set-similarity-search-benchmarks

Benchmark Datasets for Set Similarity Search
10
star
11

go-datasketch

Probabilistic data structures for processing very large datasets (MinHash, HyperLogLog)
Go
10
star
12

planning-poker

Planning Poker game for scrum team planning using Meteor.js
JavaScript
10
star
13

xbrl2rdf

Publishing XBRL document as RDF data
Java
8
star
14

datatable

An in-memory relational table in Go similar to C#'s System.Data.DataTable.
Go
8
star
15

Stock-Portfolio-Builder

Use financial optimization models with MATLAB
MATLAB
6
star
16

WhatGPT

A ChatGPT clone made with ChatGPT (GPT-4)
JavaScript
5
star
17

counter

A frequency counter similar to Python's collections.Counter with additional support of other statistics.
Go
4
star
18

angularjs-d3-flask-demo

Using AngularJS, d3.js and Flask to create interactive demo.
JavaScript
4
star
19

secxbrl

Download SEC XBRL Filings
Java
2
star
20

chatgpt-data-analysis-examples

Examples of using ChatGPT with Code Interpreter Plugin for data analysis
Python
1
star
21

nserc-subjects

Use NSERC award application summaries to predict research subjects
Python
1
star
22

ekzhu.github.io

SCSS
1
star