• Stars
    star
    339
  • Rank 124,632 (Top 3 %)
  • Language
    Rust
  • Created about 1 year ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The Fast Vector Similarity Library is designed to provide efficient computation of various similarity measures between vectors.

Fast Vector Similarity Library

Introduction

The Fast Vector Similarity Library is designed to provide efficient computation of various similarity measures between vectors. It is suitable for tasks such as data analysis, machine learning, and statistics, where measuring the relationship between vectors is essential. Written in Rust, the library offers a high-performance solution and can be easily integrated with Python through provided bindings.

Features

Similarity Measures

The library implements several popular similarity measures, including:

  1. Spearman's Rank-Order Correlation (spearman_rho)
  2. Kendall's Tau Rank Correlation (kendall_tau)
  3. Approximate Distance Correlation (approximate_distance_correlation)
  4. Jensen-Shannon Similarity (jensen_shannon_similarity)
  5. Hoeffding's D Measure (hoeffding_d)

Bootstrapping Technique

The library supports bootstrapping for robust similarity computation. This technique involves randomly resampling the dataset to estimate the distribution of the similarity measures, providing more confidence in the results.

Performance Optimizations

To achieve high efficiency, the library leverages:

  • Parallel Computing: Using the rayon crate, computations are parallelized across available CPU cores.
  • Vectorized Operations: The library leverages efficient vectorized operations provided by the ndarray crate.

Python Bindings

The library includes Python bindings to enable seamless integration with Python code. Functions py_compute_vector_similarity_stats and py_compute_bootstrapped_similarity_stats are exposed, allowing you to compute similarity statistics and bootstrapped similarity statistics, respectively.

Installation

Rust

Include the library in your Rust project by adding it to your Cargo.toml file.

Python

You can install the Python bindings for the Fast Vector Similarity Library directly from PyPI using the following command:

pip install fast_vector_similarity

This command will download and install the package, making it available for use in your Python projects.

Use with Text Embedding Vectors from LLMs

The Python demo code below introduces additional functionality that allows the Fast Vector Similarity Library to easily work with text embedding vectors from Language Models like LLMs (e.g., Llama2), such as those generated by my other library, the Llama2 Embeddings FastAPI Service (particularly, the output of the /get_all_embedding_vectors_for_document endpoint):

1. Converting Text Embeddings to Pandas DataFrame

The function convert_embedding_json_to_pandas_df reads text embeddings from a JSON file and converts them into a Pandas DataFrame. Each embedding is associated with a specific text (e.g., a sentence from Shakespeare's Sonnets), and the DataFrame structure facilitates manipulation and analysis of these embeddings.

2. Applying Fast Vector Similarity to Text Embeddings

The function apply_fvs_to_vector takes a row of embeddings and a query embedding, applies the chosen similarity measures, and returns the results as a JSON object. This function leverages the core similarity computation functionality provided by the Fast Vector Similarity Library, allowing it to be used directly on text embeddings.

3. Comparing Embeddings with Large Datasets

The main section of the code includes an example where text embeddings from Llama2 are loaded and analyzed. This part of the code demonstrates how to:

  • Select a random query embedding from the dataset.
  • Compute similarity measures between the query embedding and the rest of the dataset.
  • Create a DataFrame to hold the similarity results.
  • Sort and display the top 10 most similar embeddings by Hoeffding's D.

4. Compatibility with Higher-Dimensional Embeddings

The example works with 4096-dimensional vectors, showcasing the library's ability to handle high-dimensional data typical of modern language models. This compatibility ensures that the library can be used with a wide range of language models and text embedding techniques. In fact, the library can easily handle even larger dimensions.

Example Python Code

import time
import numpy as np
import json
import pandas as pd
from random import choice
import fast_vector_similarity as fvs

def convert_embedding_json_to_pandas_df(file_path):
    # Read the JSON file
    with open(file_path, 'r') as file:
        data = json.load(file)
    # Extract the text and embeddings
    texts = [item['text'] for item in data]
    embeddings = [item['embedding'] for item in data]
    # Determine the total number of vectors and the dimensions of each vector
    total_vectors = len(embeddings)
    vector_dimensions = len(embeddings[0]) if total_vectors > 0 else 0
    # Print the total number of vectors and dimensions
    print(f"Total number of vectors: {total_vectors}")
    print(f"Dimensions of each vector: {vector_dimensions}")
    # Convert the embeddings into a DataFrame
    df = pd.DataFrame(embeddings, index=texts)
    return df

def apply_fvs_to_vector(row_embedding, query_embedding):
    params = {
        "vector_1": query_embedding.tolist(),
        "vector_2": row_embedding.tolist(),
        "similarity_measure": "all"
    }
    similarity_stats_str = fvs.py_compute_vector_similarity_stats(json.dumps(params))
    return json.loads(similarity_stats_str)

def main():
    length_of_test_vectors = 15000
    print(f"Generating 2 test vectors of length {length_of_test_vectors}...")
    vector_1 = np.linspace(0., length_of_test_vectors - 1, length_of_test_vectors)
    vector_2 = vector_1 ** 0.2 + np.random.rand(length_of_test_vectors)
    print("Generated vector_1 using linear spacing and vector_2 using vector_1 with a power of 0.2 and some random noise.\n")

    similarity_measure = "all" # Or specify a particular measure
    params = {
        "vector_1": vector_1.tolist(),
        "vector_2": vector_2.tolist(),
        "similarity_measure": similarity_measure
    }
    
    # Time the exact similarity calculation
    print("Computing Exact Similarity Measures...")
    start_time_exact = time.time()
    similarity_stats_str = fvs.py_compute_vector_similarity_stats(json.dumps(params))
    similarity_stats_json = json.loads(similarity_stats_str)
    elapsed_time_exact = time.time() - start_time_exact
    print(f"Time taken for exact calculation: {elapsed_time_exact:.5f} seconds")

    # Print results
    print("_______________________________________________________________________________________________________________________________________________\n")
    print("Spearman's rho:", similarity_stats_json["spearman_rho"])
    print("Kendall's tau:", similarity_stats_json["kendall_tau"])
    print("Distance Correlation:", similarity_stats_json["approximate_distance_correlation"])
    print("Jensen-Shannon Similarity:", similarity_stats_json["jensen_shannon_similarity"])
    print("Hoeffding's D:", similarity_stats_json["hoeffding_d"])
    print("_______________________________________________________________________________________________________________________________________________\n")

    # Bootstrapped calculations
    number_of_bootstraps = 2000
    n = 15
    sample_size = int(length_of_test_vectors / n)

    print(f"Computing Bootstrapped Similarity Measures with {number_of_bootstraps} bootstraps and a sample size of {sample_size}...")
    start_time_bootstrapped = time.time()
    params_bootstrapped = {
        "x": vector_1.tolist(),
        "y": vector_2.tolist(),
        "sample_size": sample_size,
        "number_of_bootstraps": number_of_bootstraps,
        "similarity_measure": similarity_measure
    }
    bootstrapped_similarity_stats_str = fvs.py_compute_bootstrapped_similarity_stats(json.dumps(params_bootstrapped))
    bootstrapped_similarity_stats_json = json.loads(bootstrapped_similarity_stats_str)
    elapsed_time_bootstrapped = time.time() - start_time_bootstrapped
    print(f"Time taken for bootstrapped calculation: {elapsed_time_bootstrapped:.5f} seconds")

    time_difference = abs(elapsed_time_exact - elapsed_time_bootstrapped)
    print(f"Time difference between exact and robust bootstrapped calculations: {time_difference:.5f} seconds")
    
    # Print bootstrapped results
    print("_______________________________________________________________________________________________________________________________________________\n")
    print("Number of Bootstrap Iterations:", bootstrapped_similarity_stats_json["number_of_bootstraps"])
    print("Bootstrap Sample Size:", bootstrapped_similarity_stats_json["sample_size"])
    print("\nRobust Spearman's rho:", bootstrapped_similarity_stats_json["spearman_rho"])
    print("Robust Kendall's tau:", bootstrapped_similarity_stats_json["kendall_tau"])
    print("Robust Distance Correlation:", bootstrapped_similarity_stats_json["approximate_distance_correlation"])
    print("Robust Jensen-Shannon Similarity:", bootstrapped_similarity_stats_json["jensen_shannon_similarity"])
    print("Robust Hoeffding's D:", bootstrapped_similarity_stats_json["hoeffding_d"])
    print("_______________________________________________________________________________________________________________________________________________\n")

    # Compute the differences between exact and bootstrapped results
    measures = ["spearman_rho", "kendall_tau", "approximate_distance_correlation", "jensen_shannon_similarity", "hoeffding_d"]
    for measure in measures:
        exact_value = similarity_stats_json[measure]
        bootstrapped_value = bootstrapped_similarity_stats_json[measure]
        absolute_difference = abs(exact_value - bootstrapped_value)
        percentage_difference = (absolute_difference / exact_value) * 100

        print(f"\nDifference between exact and bootstrapped {measure}: {absolute_difference}")
        print(f"Difference as % of the exact value: {percentage_difference:.2f}%")

    print("Now testing with a larger dataset, using sentence embedddings from Llama2 (4096-dimensional vectors) on some Shakespeare Sonnets...")
    # Load the embeddings into a DataFrame
    input_file_path = "sample_input_files/Shakespeare_Sonnets_small.json"
    embeddings_df = convert_embedding_json_to_pandas_df(input_file_path)
    
    # Select a random row for the query embedding
    query_embedding_index = choice(embeddings_df.index)
    query_embedding = embeddings_df.loc[query_embedding_index]
    print(f"Selected query embedding for sentence: `{query_embedding_index}`")

    # Remove the selected row from the DataFrame
    embeddings_df = embeddings_df.drop(index=query_embedding_index)

    # Apply the function to each row of embeddings_df
    json_outputs = embeddings_df.apply(lambda row: apply_fvs_to_vector(row, query_embedding), axis=1)

    # Create a DataFrame from the list of JSON outputs
    vector_similarity_results_df = pd.DataFrame.from_records(json_outputs)
    vector_similarity_results_df.index = embeddings_df.index

    # Add the required columns to the DataFrame
    columns = ["spearman_rho", "kendall_tau", "approximate_distance_correlation", "jensen_shannon_similarity", "hoeffding_d"]
    vector_similarity_results_df = vector_similarity_results_df[columns]
    
    # Sort the DataFrame by the hoeffding_d column in descending order
    vector_similarity_results_df = vector_similarity_results_df.sort_values(by="hoeffding_d", ascending=False)
    
    print("\nTop 10 most similar embedding results by Hoeffding's D:")
    print(vector_similarity_results_df.head(10))

Screenshot of Output:

image

Usage

In Rust

You can utilize the core functionality in Rust by calling functions like compute_vector_similarity_stats and compute_bootstrapped_similarity_stats with appropriate parameters.

In Python

Install the provided Python package and use the functions as demonstrated in the example above.

About the Different Similarity Measures

  1. Spearman's Rank-Order Correlation (spearman_rho): Spearman's rho assesses the strength and direction of the monotonic relationship between two ranked variables. Unlike Pearson's correlation, it does not assume linearity and is less sensitive to outliers, making it suitable for non-linear relationships.

  2. Kendall's Tau Rank Correlation (kendall_tau): Kendall's tau measures the ordinal association between two variables. It's valuable for its ability to handle ties and its interpretability as a probability. Unlike other correlation measures, it's based on the difference between concordant and discordant pairs, making it robust and versatile.

  3. Approximate Distance Correlation (approximate_distance_correlation): Distance correlation quantifies both linear and non-linear dependencies between variables. It has the powerful property of being zero if and only if the variables are independent. This makes it a more comprehensive measure of association than traditional correlation metrics.

  4. Jensen-Shannon Similarity (jensen_shannon_similarity): Jensen-Shannon Similarity is derived from the Jensen-Shannon Divergence, a symmetrical and smoothed version of the Kullback-Leibler divergence. It quantifies the similarity between two probability distributions and is especially useful in comparing distributions that may have non-overlapping support.

  5. Hoeffding's D Measure (hoeffding_d): Hoeffding's D is a non-parametric measure that detects complex non-linear relationships between variables. The D statistic is robust against various alternatives to independence, including non-monotonic relationships. It's particularly useful when the nature of the relationship between variables is unknown or unconventional.

Each of these measures has unique properties and applicability, providing a comprehensive toolkit for understanding the relationships between variables in different contexts. By including both classical and more specialized measures, the library offers flexibility and depth in analyzing vector similarities.

Bootstrapping Technique for Robust Estimators

Bootstrapping is a powerful statistical method that allows us to estimate the distribution of a statistic by repeatedly resampling with replacement from the observed data. In the context of the Fast Vector Similarity Library, bootstrapping is employed to obtain robust estimators of similarity measures between vectors. Here's how it works:

Process

  1. Random Subset Selection: For each bootstrap iteration, a random subset of indices is selected from the original vectors. These indices are used to create resampled vectors that preserve the original data's structure and relationships.

  2. Compute Similarity: The selected random subset of indices is used to compute the similarity between the resampled vectors according to the chosen similarity measure. This process is repeated many times, resulting in a distribution of similarity estimates.

  3. Robust Averaging: To obtain a robust estimator that is less sensitive to outliers, the interquartile range (IQR) of the distribution of similarity estimates is considered. By keeping only the values within the IQR, the influence of extreme values is minimized. A robust average of the values within this range provides the final robust similarity estimate.

Why It's Useful

The bootstrapping technique offers several advantages:

  • Robustness to Outliers: By focusing on the interquartile range and using a robust average, bootstrapping minimizes the influence of outliers. This makes the estimator more reliable, especially when the original data may contain anomalous values.

  • Model-Free Estimation: Bootstrapping doesn't assume a specific underlying distribution, making it a non-parametric method. This flexibility allows it to be applied to various types of data and similarity measures.

  • Confidence Intervals: Bootstrapping can also be used to construct confidence intervals for the similarity measures. These intervals provide insights into the uncertainty associated with the estimate, enhancing interpretability.

  • Enhanced Understanding of Relationships: By assessing the distribution of similarity measures, bootstrapping provides a more comprehensive view of the relationships between vectors. This deeper understanding can be vital in contexts like data analysis, where subtle differences in relationships may have significant implications.

The inclusion of the bootstrapping technique in the Fast Vector Similarity Library adds a layer of robustness and flexibility to the computation of similarity measures. It offers users a way to obtain more reliable and insightful estimates, accommodating various data characteristics and analytical needs. Whether dealing with potential outliers or seeking a more nuanced understanding of vector relationships, bootstrapping provides a valuable tool within the library's comprehensive suite of features.

More Repositories

1

llm_aided_ocr

Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections.
Python
1,990
star
2

swiss_army_llama

A FastAPI service for semantic text search using precomputed embeddings and advanced similarity measures, with built-in support for various file types through textract.
Python
919
star
3

bulk_transcribe_youtube_videos_from_playlist

Easily take an entire YouTube playlist and turn it into high quality transcripts using Whisper.
Python
430
star
4

automatic_log_collector_and_analyzer

Replace Splunk in your small company with this one weird trick!
Python
394
star
5

sqlalchemy_data_model_visualizer

Automatically turn your SQLalchemy Data Models into a Nice SVG Diagram
Python
244
star
6

visual_astar_python

Generate Cool-Looking Mazes and Animations Illustrating the A* Pathfinding Algorithm
Python
167
star
7

introduction_to_temporal_logic

An introduction to temporal logic and how it can be used to analyze concurrency
103
star
8

hoeffdings_d_explainer

A Detailed Introduction to My Favorite Statistical Measure, Hoeffding's D
94
star
9

bakery_algorithm

Lamport's Bakery Algorithm Demonstrated in Python
Python
94
star
10

grassmann_article

50
star
11

llm_aided_transcription_improvement

Improve a Transcript from a YouTube Video, Podcast, or other source using LLMs
Python
41
star
12

cloud_benchmarker

Cloud Benchmarker automates performance testing of cloud instances, offering insightful charts and tracking over time.
Python
33
star
13

the_lighthill_debate_on_ai

A Full Transcript of the Lighthill Debate on AI from 1973, with Introductory Remarks
25
star
14

anti_alzheimers_flasher

HTML
20
star
15

youtube_transcript_cleaner

YouTube Transcript Cleaner is a simple web-based application that improves the readability of YouTube transcripts.
HTML
19
star
16

prepareprojectforllmprompt

Transform your code project into a Markdown document optimized for interaction with Language Learning Models like GPT-4, complete with dynamic file selection and token management features.
TypeScript
17
star
17

sassaman_and_dingledine_on_remailers_at_blackhat_2003

A Full Transcript and Slides from the talk "Attacks on Anonymity Systems: The Theory" by Len Sassaman and Roger Dingledine in 2003
13
star
18

paxos_vs_raft

11
star
19

most-influential-github-repo-stars

See the most influential users who have starred or forked a given repo (sorted by earned stars and follower count)
TypeScript
8
star
20

some_thoughts_on_ai_alignment

Some Thoughts on AI Alignment: Using AI to Control AI
4
star
21

textsynth_server_cluster

Python
3
star
22

causal_direction_estimation_from_data

Automated Python script for data-driven causal inference, leveraging machine learning models and statistical error analysis to uncover hidden causal relationships.
Python
2
star
23

llm_aided_legal_discovery_bot

Python
2
star
24

ChatTTS

Forked to make it installable via pip
Jupyter Notebook
2
star
25

ball_fighters

HTML
2
star
26

automatic_cpp_code_analysis_with_gpt

Python
1
star
27

github_stars_curve

JavaScript
1
star
28

advice_for_learning_to_code_and_making_an_app

HTML
1
star
29

github-diff-viewer

JavaScript
1
star