• Stars
    star
    2,471
  • Rank 18,464 (Top 0.4 %)
  • Language
    Python
  • License
    MIT License
  • Created over 9 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

datasketch: Big Data Looks Small

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch Usage
MinHash estimate Jaccard similarity and cardinality
Weighted MinHash estimate weighted Jaccard similarity
HyperLogLog estimate cardinality
HyperLogLog++ estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index For Data Sketch Supported Query Type
MinHash LSH MinHash, Weighted MinHash Jaccard Threshold
MinHash LSH Forest MinHash, Weighted MinHash Jaccard Top-K
MinHash LSH Ensemble MinHash Containment Threshold

datasketch must be used with Python 2.7 or above, NumPy 1.11 or above, and Scipy.

Note that MinHash LSH and MinHash LSH Ensemble also support Redis and Cassandra storage layer (see MinHash LSH at Scale).

Install

To install datasketch using pip:

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

pip install datasketch[redis]

To install with Cassandra dependency:

pip install datasketch[cassandra]

More Repositories

1

SetSimilaritySearch

All-pair set similarity search on millions of sets in Python and on a laptop
Python
587
star
2

lsh

Locality Sensitive Hashing for Go (Multi-probe LSH, LSH Forest, basic LSH)
Go
101
star
3

lshensemble

LSH index for approximate set containment search
Go
56
star
4

llm_maze_agent

Navigating a maze using LLM agent
Python
35
star
5

go-fasttext

Facebook fastText database in SQLite with Go API
Go
32
star
6

go-set-similarity-search

Efficient set similarity search algorithms implemented in Go
Go
29
star
7

go-sql-lsh

Locality Sensitive Hashing using Golang and SQL database
Go
27
star
8

minhash-lsh

Minhash LSH in Golang
Go
25
star
9

josie

Code and Benchmarks for JOSIE (SIGMOD 2019)
Go
18
star
10

set-similarity-search-benchmarks

Benchmark Datasets for Set Similarity Search
10
star
11

go-datasketch

Probabilistic data structures for processing very large datasets (MinHash, HyperLogLog)
Go
10
star
12

planning-poker

Planning Poker game for scrum team planning using Meteor.js
JavaScript
10
star
13

xbrl2rdf

Publishing XBRL document as RDF data
Java
8
star
14

datatable

An in-memory relational table in Go similar to C#'s System.Data.DataTable.
Go
8
star
15

Stock-Portfolio-Builder

Use financial optimization models with MATLAB
MATLAB
6
star
16

WhatGPT

A ChatGPT clone made with ChatGPT (GPT-4)
JavaScript
5
star
17

counter

A frequency counter similar to Python's collections.Counter with additional support of other statistics.
Go
4
star
18

angularjs-d3-flask-demo

Using AngularJS, d3.js and Flask to create interactive demo.
JavaScript
4
star
19

secxbrl

Download SEC XBRL Filings
Java
2
star
20

chatgpt-data-analysis-examples

Examples of using ChatGPT with Code Interpreter Plugin for data analysis
Python
1
star
21

nserc-subjects

Use NSERC award application summaries to predict research subjects
Python
1
star
22

ekzhu.github.io

SCSS
1
star