• Stars
    star
    139
  • Rank 261,417 (Top 6 %)
  • Language
    Python
  • License
    Other
  • Created about 10 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Locality Sensitive Hashing (LSH) library with an emphasis on large, highly-dimensional datasets.

SparseLSH

A locality sensitive hashing library with an emphasis on large, highly-dimensional datasets.

Features

  • Fast and memory-efficient calculations using sparse matrices.
  • Built-in support for key-value storage backends: pure-Python, Redis (memory-bound), LevelDB, BerkeleyDB
  • Multiple hash indexes support (based on Kay Zhu's lshash)
  • Built-in support for common distance/objective functions for ranking outputs.

Details

SparseLSH is based on a fork of Kay Zhu's lshash, and is suited for datasets that won't fit into main memory or are highly dimensional. Using sparse matrices allows for speedups of easily over an order of magnitude compared to using dense, list-based or numpy array-based vector math. Sparse matrices also makes it possible to deal with these datasets purely in memory using python dicts or through Redis. When this isn't appropriate, you can use one of the disk-based key-value stores, LevelDB and BerkeleyDB. Serialization is done using cPickle (for raw C speedups), falling back to python pickle if it's not available.

Installation

The easy way:

pip install sparselsh

Or you can clone this repo and run the minimal install:

pip install .

If you would like to use the LevelDB or Redis storage backends, you can install the dependencies from the optional-requirements.txt:

pip install -r optional-requirements.txt

Quickstart

You can quickly use LSH using the bundled sparselsh command line utility. Simply pass the path to a file containing records to be clustered, one per line, and the script will output groups of similar items.

sparselsh path/to/recordsfile.txt

To create 4-bit hashes for input data of 7 dimensions:

from sparselsh import LSH
from scipy.sparse import csr_matrix

X = csr_matrix([
    [3, 0, 0, 0, 0, 0, -1],
    [0, 1, 0, 0, 0, 0,  1],
    [1, 1, 1, 1, 1, 1,  1]
])

# One label for each input point
y = ["label-one", "second", "last"]

lsh = LSH(
    4,
    X.shape[1],
    num_hashtables=1,
    storage_config={"dict":None}
)

lsh.index(X, extra_data=y)

# Build a 1-D (single row) sparse matrix
X_sim = csr_matrix([[1, 1, 1, 1, 1, 1, 0]])
# find the point in X nearest to X_sim
points = lsh.query(X_sim, num_results=1)
# split up the first result into its parts
(point, label), dist = points[0]
print(label)  # 'last'

The query above result in a list of matrix-class tuple & similarity score tuples. A lower score is better in this case (the score here is 1.0). Here's a breakdown of the return value when query is called with a single input row (1-dimensional sparse matrix, the output is different when passing multiple query points):

[((<1x7 sparse matrix of type '<type 'numpy.int64'>' with 7 stored elements in Compressed Sparse Row format>, 'label'), 1.0)]

We can look at the most similar matched item by accessing the sparse array and invoking it's todense function:

#                      ,------- Get first result-score tuple
#                     |   ,---- Get data. [1] is distance score
#                     |  |  ,-- Get the point. [1] is extra_data
#                     |  |  |
In [11]: print points[0][0][0].todense()
[[1 1 1 1 1 1 1]]

You can also pass multiple records to query by simply increasing the dimension of the input to query. This will change the output data to have one extra dimension, representing the input query. (NOTE: When then dimension is 1, a.k.a. a single sparse row, this extra dimension won't be added.) Here's the output when query is passed a 2-dimensional input:

[
  [((<1x7 sparse matrix ...>, 'label'), 1.0)],
  [((<1x7 sparse matrix ...>, 'extra'), 0.8),
   ((<1x7 sparse matrix ...>, 'another'), 0.3)]
]

Here, you can see an extra dimension, one for each query point passed to query. The data structure for each query point result is the same as the 1-Dimensional output.

Main Interface

Most of the parameters are supplied at class init time:

LSH( hash_size,
     input_dim,
     num_of_hashtables=1,
     storage_config=None,
     matrices_filename=None,
     overwrite=False)

Parameters:

hash_size:
    The length of the resulting binary hash. This controls how many "buckets"
    there will be for items to be sorted into.

input_dim:
    The dimension of the input vector. This needs to be the same as the input
    points.

num_hashtables = 1:
    (optional) The number of hash tables used. More hashtables increases the
    probability of hash-collisions and the more similar items are likely
    to be found for a query item. Increase this to get better accuracy
    at the expense of increased memory usage.

storage = None:
    (optional) A dict representing the storage backend and configuration
    options. The following storage backends are supported with the following
    configurations:
        In-Memory Python Dictionary:
            {"dict": None} # Takes no options
        Redis:
            {"redis": {"host": "127.0.0.1", "port": 6379, "db": 0}
        LevelDB:
            {"leveldb":{"db": "ldb"}}
            Where "ldb" specifies the directory to store the LevelDB database.
            (In this case it will be `./ldb/`)
        Berkeley DB:
            {"berkeleydb":{"filename": "./db"}}
            Where "filename" is the location of the database file.

matrices_filename = None:
    (optional) Specify the path to the .npz file random matrices are stored
    or to be stored if the file does not exist yet. If you change the input
    dimensions or the number of hashtables, you'll need to set the following
    option, overwrite, to True, or delete this file.

overwrite = False:
    (optional) Whether to overwrite the matrices file if it already exists.

Index (Add points to hash table):

  • To index a data point of a given LSH instance:

    lsh.index(input_point, extra_data=None)

Parameters:

input_point:
    The input data point is an array or tuple of numbers of input_dim.

extra_data = None:
    (optional) Extra data to be added along with the input_point.
    This can be used to hold values like class labels, URIs, titles, etc.

This function returns nothing.

Query (Search for similar points)

To query a data point against a given LSH instance:

lsh.query(query_point, num_results=None, distance_func="euclidean")

Parameters:

query_point:
    The query data point is a sparse CSR matrix.

num_results = None:
    (optional) Integer, specifies the max amount of results to be
    returned. If not specified all candidates will be returned as a
    list in ranked order.
    NOTE: You do not save processing by limiting the results. Currently,
    a similarity ranking and sort is done on all items in the hashtable
    regardless if this parameter.

distance_func = "euclidean":
    (optional) Distance function to use to rank the candidates. By default
    euclidean distance function will be used.

Returns a list of tuples, each of which has the original input point (which will be a tuple of csr-matrix, extra_data or just the csr-matrix if no extra data was supplied) and a similarity score.

More Repositories

1

BitcoinTradingAlgorithmToolkit

A framework for logging, simulating, and analyzing prices of currencies on various exchanges using technical analysis, fuzzy logic, and neural networks.
Python
177
star
2

chatgpt-document-extraction

A proof of concept tool for using ChatGPT to transform messy text documents into structured JSON
Python
119
star
3

autoscrape-py

An automated, programming-free web scraper for interactive sites
HTML
103
star
4

sentence-autosegmentation

Deep-learning based sentence auto-segmentation from unstructured text w/o punctuation
Python
37
star
5

artificial_seinfeld

Tools for generating artificial Seinfeld episodes using deep learning ... very serious project
Python
14
star
6

reason-act-sqlite-py

A demonstration of using reason and act with llama.cpp and a LLM to pose plain english queries to a sqlite database
Python
10
star
7

llm-document-extraction

A proof of concept tool for using local LLMs to transform messy text documents into structured JSON
Python
10
star
8

haunted_house_disassembly

Atari 2600 MOS 6502/7 commented disassembly of the game Haunted House
Assembly
7
star
9

bitcore-namecoin

Namecoin support for bitcore
JavaScript
6
star
10

ref-extract

Reference Extraction from Text Data (with Inaccuracy Support)
Python
5
star
11

virtualcurrency-trading-alerts

An alert system for when an event (volume, price, etc.) on a Bitcoin exchange hits a threshold in a specified time-frame
Python
5
star
12

namecoin-testnet-box

A private namecoin testnet based on namecoin core
Makefile
3
star
13

nicar2022-db-optimization

This is a presentation for my NICAR 2022 Database Optimization class.
JavaScript
3
star
14

AustinMunicipalCourtScraper

Takes a person's last name and DOB, grabs their Austin Municipal Court case history, and writes it to a CSV file.
Python
3
star
15

hextractor

Workbench module for Hext extraction of data
JavaScript
2
star
16

datasette-shorturl

A Datasette plugin that provides short URLs for your queries
HTML
2
star
17

Austin-Traffic-Scraper

Austin-Travis County Traffic Report Page Scraper
PHP
2
star
18

rescue-me

A webapp that makes linking animal rescues with potential adopters simple and painless
JavaScript
2
star
19

what-they-said

An interactive tool for searching hundreds of hours of 2016 campaign speeches
JavaScript
1
star
20

page-change-monitor

A simple all-in-one tool to monitor a set of web pages for changes
JavaScript
1
star
21

autoscrape-www

A frontend for driving AutoScrape via a web browser
JavaScript
1
star
22

parse-tx-cfr

An experimental parser for Texas-style Campaign Finance Reports
Clojure
1
star
23

APDIncidentReportsScraper

Scrape the Austin Police Department's messy Indcident Reports Database (police reports) into a machine-readable CSV format.
Python
1
star
24

python-rss2irc

Forked version of gehaxelt/python-rss2irc with arXiv and other improvements (RSS to IRC bot)
Python
1
star
25

AC-Crime-Visualization

A d3.js crime & sentence length by race/ethnicity visualization.
JavaScript
1
star
26

tabula-draw-columns

Simple tool to visually build column config strings for tabula-java
HTML
1
star