Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

Dart

Java

C#

Emacs Lisp

Julia

Shell

Crystal

Lua

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

PHP

Java

Scala

R

Zig

Objective-C

TypeScript

Python

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇹🇱 Timor-Leste

🇬🇩 Grenada

🇩🇯 Djibouti

🇨🇱 Chile

🇨🇾 Cyprus

🇬🇬 Guernsey

🇱🇧 Lebanon

🇹🇴 Tonga

All Countries Compare Countries

ekzhu/datasketch

Stars
2,471
Rank 18,602 (Top 0.4 %)
Language
Python
License
MIT License
Created over 9 years ago
Updated 6 months ago

ekzhu/datasketch

ekzhu

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

MinHash, LSH, LSH Forest, Weighted MinHash, HyperLogLog, HyperLogLog++, LSH Ensemble and HNSW

datasketch: Big Data Looks Small

datasketch gives you probabilistic data structures that can process and search very large amount of data super fast, with little loss of accuracy.

This package contains the following data sketches:

Data Sketch	Usage
MinHash	estimate Jaccard similarity and cardinality
Weighted MinHash	estimate weighted Jaccard similarity
HyperLogLog	estimate cardinality
HyperLogLog++	estimate cardinality

The following indexes for data sketches are provided to support sub-linear query time:

Index	For Data Sketch	Supported Query Type
MinHash LSH	MinHash, Weighted MinHash	Jaccard Threshold
MinHash LSH Forest	MinHash, Weighted MinHash	Jaccard Top-K
MinHash LSH Ensemble	MinHash	Containment Threshold

datasketch must be used with Python 2.7 or above, NumPy 1.11 or above, and Scipy.

Note that MinHash LSH and MinHash LSH Ensemble also support Redis and Cassandra storage layer (see MinHash LSH at Scale).

Install

To install datasketch using pip:

pip install datasketch

This will also install NumPy as dependency.

To install with Redis dependency:

pip install datasketch[redis]

To install with Cassandra dependency:

pip install datasketch[cassandra]

SetSimilaritySearch

All-pair set similarity search on millions of sets in Python and on a laptop

lsh

Locality Sensitive Hashing for Go (Multi-probe LSH, LSH Forest, basic LSH)

lshensemble

LSH index for approximate set containment search

llm_maze_agent

Navigating a maze using LLM agent

go-fasttext

Facebook fastText database in SQLite with Go API

go-set-similarity-search

Efficient set similarity search algorithms implemented in Go

go-sql-lsh

Locality Sensitive Hashing using Golang and SQL database

minhash-lsh

Minhash LSH in Golang

josie

Code and Benchmarks for JOSIE (SIGMOD 2019)

set-similarity-search-benchmarks

Benchmark Datasets for Set Similarity Search

go-datasketch

Probabilistic data structures for processing very large datasets (MinHash, HyperLogLog)

planning-poker

Planning Poker game for scrum team planning using Meteor.js

datatable

An in-memory relational table in Go similar to C#'s System.Data.DataTable.

xbrl2rdf

Publishing XBRL document as RDF data

Stock-Portfolio-Builder

Use financial optimization models with MATLAB

WhatGPT

A ChatGPT clone made with ChatGPT (GPT-4)

counter

A frequency counter similar to Python's collections.Counter with additional support of other statistics.

angularjs-d3-flask-demo

Using AngularJS, d3.js and Flask to create interactive demo.

chatgpt-data-analysis-examples

Examples of using ChatGPT with Code Interpreter Plugin for data analysis

secxbrl

Download SEC XBRL Filings

nserc-subjects

Use NSERC award application summaries to predict research subjects

ekzhu.github.io