• Stars
    star
    750
  • Rank 58,455 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created about 3 years ago
  • Updated 27 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Automatically create Faiss knn indices with the most optimal similarity search parameters.

AutoFaiss

pypi ci Open In Colab

Automatically create Faiss knn indices with the most optimal similarity search parameters.

It selects the best indexing parameters to achieve the highest recalls given memory and query speed constraints.

Doc and posts and notebooks

Using faiss efficient indices, binary search, and heuristics, Autofaiss makes it possible to automatically build in 3 hours a large (200 million vectors, 1TB) KNN index in a low amount of memory (15 GB) with latency in milliseconds (10ms).

Get started by running this colab notebook, then check the full documentation.
Get some insights on the automatic index selection function with this colab notebook.

Then you can check our multimodal search example (using OpenAI Clip model).

Read the medium post to learn more about it!

Installation

To install run pip install autofaiss

It's probably best to create a virtual env:

python -m venv .venv/autofaiss_env
source .venv/autofaiss_env/bin/activate
pip install -U pip
pip install autofaiss

Using autofaiss in python

If you want to use autofaiss directly from python, check the API documentation and the examples

In particular you can use autofaiss with on memory or on disk embeddings collections:

Using in memory numpy arrays

If you have a few embeddings, you can use autofaiss with in memory numpy arrays:

from autofaiss import build_index
import numpy as np

embeddings = np.float32(np.random.rand(100, 512))
index, index_infos = build_index(embeddings, save_on_disk=False)

query = np.float32(np.random.rand(1, 512))
_, I = index.search(query, 1)
print(I)

Using numpy arrays saved as .npy files

If you have many embeddings file, it is preferred to save them on disk as .npy files then use autofaiss like this:

from autofaiss import build_index

build_index(embeddings="embeddings", index_path="my_index_folder/knn.index",
            index_infos_path="my_index_folder/index_infos.json", max_index_memory_usage="4G",
            current_memory_available="4G")

Memory-mapped indices

Faiss makes it possible to use memory-mapped indices. This is useful when you don't need a fast search time (>50ms) and still want to reduce the memory footprint to the minimum.

We provide the should_be_memory_mappable boolean in build_index function to generate memory-mapped indices only. Note: Only IVF indices can be memory-mapped in faiss, so the output index will be a IVF index.

To load an index in memory mapping mode, use the following code:

import faiss
faiss.read_index("my_index_folder/knn.index", faiss.IO_FLAG_MMAP | faiss.IO_FLAG_READ_ONLY)

You can have a look to the examples to see how to use it.

Technical note: You can create a direct map on IVF indices with index.make_direct_map() (or directly from the build_index function by passing the make_direct_map boolean). Doing this speeds up a lot the .reconstruct() method, function that gives you the value of one of your vector given its rank. However, this mapping will be stored in RAM... We advise you to create your own direct map in a memory-mapped numpy array and then call .reconstruct_from_offset() with your custom direct_map.

Using autofaiss with pyspark

Autofaiss allows you to build indices with Spark for the following two use cases:

  • To build a big index in a distributed way
  • Given a partitioned dataset of embeddings, building one index per partition in parallel and in a distributed way.

Prerequisities:

  1. Install pyspark: pip install pyspark.
  2. Prepare your embeddings files (partitioned or not).
  3. Create a Spark session before calling autofaiss. If no Spark session exists, a default session will be creaed with a minimum configuration.

Creating a big index in a distributed way

See distributed_autofaiss.md for a complete guide.

It is possible to generate an index that would require more memory than what's available. To do so, you can control the number of index splits that will compose your index with nb_indices_to_keep. For example, if nb_indices_to_keep is 10 and index_path is knn.index, the final index will be decomposed into 10 smaller indexes:

  • knn.index01
  • knn.index02
  • knn.index03
  • ...
  • knn.index10

A concrete example shows how to produce N indices and how to use them.

Creating partitioned indexes

Given a partitioned dataset of embeddings, it is possible to create one index per partition by calling the method build_partitioned_indexes.

See this example that shows how to create partitioned indexes.

Using the command line

Create embeddings

import os
import numpy as np
embeddings = np.random.rand(1000, 100)
os.mkdir("embeddings")
np.save("embeddings/part1.npy", embeddings)
os.mkdir("my_index_folder")

Generate a Knn index

autofaiss build_index --embeddings="embeddings" --index_path="my_index_folder/knn.index" --index_infos_path="my_index_folder/index_infos.json" --metric_type="ip"

Try the index

import faiss
import glob
import numpy as np

my_index = faiss.read_index(glob.glob("my_index_folder/*.index")[0])

query_vector = np.float32(np.random.rand(1, 100))
k = 5
distances, indices = my_index.search(query_vector, k)

print(list(zip(distances[0], indices[0])))

How are indices selected ?

To understand better why indices are selected and what are their characteristics, check the index selection demo

Command quick overview

Quick description of the autofaiss build_index command:

embeddings -> Source path of the embeddings in numpy.
index_path -> Destination path of the created index. index_infos_path -> Destination path of the index infos. save_on_disk -> Save the index on the disk. metric_type -> Similarity distance for the queries.

index_key -> (optional) Describe the index to build.
index_param -> (optional) Describe the hyperparameters of the index.
current_memory_available -> (optional) Describe the amount of memory available on the machine.
use_gpu -> (optional) Whether to use GPU or not (not tested).

Command details

The autofaiss build_index command takes the following parameters:

Flag available Default Description
--embeddings required directory (or list of directories) containing your .npy embedding files. If there are several files, they are read in the lexicographical order. This can be a local path or a path in another Filesystem e.g. hdfs://root/... or s3://...
--index_path required Destination path of the faiss index on local machine.
--index_infos_path required Destination path of the faiss index infos on local machine.
--save_on_disk required Save the index on the disk.
--file_format "npy" File format of the files in embeddings Can be either npy for numpy matrix files or parquet for parquet serialized tables
--embedding_column_name "embeddings" Only necessary when file_format=parquet In this case this is the name of the column containing the embeddings (one vector per row)
--id_columns None Can only be used when file_format=parquet. In this case these are the names of the columns containing the Ids of the vectors, and separate files will be generated to map these ids to indices in the KNN index
--ids_path None Only useful when id_columns is not None and file_format=parquet. This will be the path (in any filesystem) where the mapping files Ids->vector index will be store in parquet format
--metric_type "ip" (Optional) Similarity function used for query: ("ip" for inner product, "l2" for euclidian distance)
--max_index_memory_usage "32GB" (Optional) Maximum size in GB of the created index, this bound is strict.
--current_memory_available "32GB" (Optional) Memory available (in GB) on the machine creating the index, having more memory is a boost because it reduces the swipe between RAM and disk.
--max_index_query_time_ms 10 (Optional) Bound on the query time for KNN search, this bound is approximative.
--min_nearest_neighbors_to_retrieve 20 (Optional) Minimum number of nearest neighbors to retrieve when querying the index. Parameter used only during index hyperparameter finetuning step, it is not taken into account to select the indexing algorithm. This parameter has the priority over the max_index_query_time_ms constraint.
--index_key None (Optional) If present, the Faiss index will be build using this description string in the index_factory, more detail in the Faiss documentation
--index_param None (Optional) If present, the Faiss index will be set using this description string of hyperparameters, more detail in the Faiss documentation
--use_gpu False (Optional) Experimental, gpu training can be faster, but this feature is not tested so far.
--nb_cores None (Optional) The number of cores to use, by default will use all cores
--make_direct_map False (Optional) Create a direct map allowing reconstruction of embeddings. This is only needed for IVF indices. Note that might increase the RAM usage (approximately 8GB for 1 billion embeddings).
--should_be_memory_mappable False (Optional) Boolean used to force the index to be selected among indices having an on-disk memory-mapping implementation.
--distributed None (Optional) If "pyspark", create the index using pyspark. Otherwise, the index is created on your local machine.
--temporary_indices_folder "hdfs://root/tmp/distributed_autofaiss_indices" (Optional) Folder to save the temporary small indices, only used when distributed = "pyspark"
--verbose 20 (Optional) Set verbosity of logging output: DEBUG=10, INFO=20, WARN=30, ERROR=40, CRITICAL=50
--nb_indices_to_keep 1 (Optional) Number of indices to keep at most when distributed is "pyspark".

Install from source

First, create a virtual env and install dependencies:

python3 -m venv .env
source .env/bin/activate
make install

python -m pytest -x -s -v tests -k "test_get_optimal_hyperparameters" to run a specific test

More Repositories

1

cassandra_exporter

Apache Cassandra® metrics exporter for Prometheus
Java
165
star
2

biggraphite

Simple Scalable Time Series Database
Python
128
star
3

babar

Profiler for large-scale distributed java applications (Spark, Scalding, MapReduce, Hive,...) on YARN.
Java
125
star
4

cuttle

An embedded job scheduler.
Scala
114
star
5

kafka-sharp

A C# Kafka driver
C#
110
star
6

kerberos-docker

Run kerberos environment in docker containers
Shell
108
star
7

lolhttp

An HTTP Server and Client library for Scala.
Scala
91
star
8

tf-yarn

Train TensorFlow models on YARN in just a few lines of code!
Python
86
star
9

Spark-RSVD

Randomized SVD of large sparse matrices on Spark
Scala
77
star
10

consul-templaterb

consul-template-like with erb (ruby) template expressiveness
Ruby
75
star
11

JVips

Java wrapper for libvips using JNI.
Java
67
star
12

deepr

The deepr module provide abstractions (layers, readers, prepro, metrics, config) to help build tensorflow models on top of tf estimators
Python
50
star
13

cluster-pack

A library on top of either pex or conda-pack to make your Python code easily available on a cluster
Python
46
star
14

findjars

Gradle plugin to debug classpath issues
Kotlin
44
star
15

kafka-ganglia

Kafka Ganglia Metrics Reporter
Java
39
star
16

garmadon

Java event logs collector for hadoop and frameworks
Java
39
star
17

graphite-remote-adapter

Fully featured graphite remote adapter for Prometheus
Go
36
star
18

marathon_exporter

A Prometheus metrics exporter for the Marathon Mesos framework
Go
34
star
19

command-launcher

A command launcher 🚀 made with ❤️
Go
31
star
20

haproxy-spoe-auth

Plugin for authorizing users against LDAP
Go
30
star
21

haproxy-spoe-go

An implementation of the SPOP protocol in Go. https://www.haproxy.org/download/2.0/doc/SPOE.txt
Go
28
star
22

vizsql

Scala and SQL happy together.
Scala
28
star
23

CriteoDisplayCTR-TFOnSpark

Python
28
star
24

py-consul

Python client for Consul (http://www.consul.io/)
Python
28
star
25

netcompare

Python
26
star
26

loop

enhance your web application development workflow
JavaScript
26
star
27

netprobify

Network probing tool crafted for datacenters (but not only)
Python
24
star
28

fromconfig

A library to instantiate any Python object from configuration files.
Python
23
star
29

openapi-comparator

C#
23
star
30

vertica-hyperloglog

C++
22
star
31

slab

An extensible Scala framework for creating monitoring dashboards.
Scala
22
star
32

socco

A Scala compiler plugin to generate documentation from Scala source files.
Scala
20
star
33

consul-bench

A tool to bench Consul Clusters
Go
19
star
34

mesos-term

Web terminal and sandbox explorer for your mesos containers
TypeScript
19
star
35

memcache-driver

Criteo's .NET MemCache driver
C#
16
star
36

NinjaTurtlesMutation

C#
16
star
37

vagrant-winrm

Vagrant 1.6+ plugin extending WinRM communication features
Ruby
16
star
38

mlflow-elasticsearchstore

ElasticSearch implementation of MlFlow tracking store
Python
16
star
39

defcon

DefCon - Status page and API for production status
Python
15
star
40

criteo-python-marketing-sdk

Official Python SDK to access the Criteo Marketing API
Python
15
star
41

mesos-external-container-logger

Mesos container logger module for logging to processes, backported from MESOS-6003
C++
14
star
42

android-publisher-sdk

Criteo Publisher SDK for Android
Java
12
star
43

lobster

Simple loop job runner
Ruby
12
star
44

berilia

Create hadoop cluster in aws ec2 for development
Scala
11
star
45

ios-publisher-sdk

Criteo Publisher SDK for iOS
Objective-C
11
star
46

mlflow-yarn

Backend implementation for running MLFlow projects on Hadoop/YARN.
Python
10
star
47

openpass

TypeScript
10
star
48

traffic-mirroring

Go
8
star
49

ipam-client

Python ipam-client library
Python
7
star
50

eslint-plugin-criteo

JavaScript
7
star
51

tableau-parser

Scala
7
star
52

gourde

Flask sugar for Python microservices
Python
7
star
53

criteo-java-marketing-sdk

Official Java SDK to access the Criteo Marketing API
Java
7
star
54

metrics-net

Archived: Capturing CLR and application-level metrics. So you know what's going on.
C#
6
star
55

casspoke

Prometheus probe exporter for Cassandra latency and availability
Java
6
star
56

newman-server

A simple webserver to run Postman collections using the newman engine
JavaScript
6
star
57

mewpoke

Memcached / couchbase probe
Java
6
star
58

je-code-crazy-filters

Python
6
star
59

ocserv-exporter

ocserv exporter for Prometheus
Go
5
star
60

http-proxy-exporter

Expose proxy performance statistics in a Prometheus-friendly way.
Go
5
star
61

kitchen-transport-speedy

Speed up kitchen file transfer using archives
Ruby
5
star
62

vertica-datasketch

C++
5
star
63

django-memcached-consul

Used consul discovered memcached servers
Python
4
star
64

skydive-visualizer

Go
4
star
65

log4j-jndi-jar-detector

Application trying to detect processes vulnerable to log4j JNDI exploit
Go
4
star
66

criteo-api-python-sdk

Python
4
star
67

RabbitMQHare

High-level RabbitMQ C# client
C#
4
star
68

automerge-plugin

Gerrit plugin to automatically merge reviews
Java
4
star
69

cassback

This project aims to backup Cassandra SSTables and store them into HDFS
Ruby
4
star
70

vertica-hll-druid

C++
3
star
71

hive-client

A Pure Scala/Thrift Hive Client
Thrift
3
star
72

fromconfig-mlflow

A fromconfig Launcher for MlFlow
Python
3
star
73

android-events-sdk

Java
3
star
74

rundeck-dsl

Groovy
3
star
75

tableau-maven-plugin

Java
3
star
76

android-publisher-sdk-examples

Java
3
star
77

ml-hadoop-experiment

Python
3
star
78

vault-auth-plugin-chef

Go
3
star
79

mesos-command-modules

Mesos modules running external commands
C++
3
star
80

scala-schemas

use scala classes as schema definition across different systems
Scala
3
star
81

tf-collective-all-reduce

Lightweight framework for distributed TensorFlow training based on dmlc/rabit
Python
3
star
82

criteo-marketing-sdk-generator

A Gradle project to generate custom SDKs for Criteo's marketing API
Mustache
3
star
83

s3-probe

Go
3
star
84

blackbox-prober

Go
3
star
85

AFK

3
star
86

kitchen-vagrant_winrm

A test-kitchen driver using vagrant-winrm
Ruby
2
star
87

graphite-dashboard-api

Graphite Dashboard API
Ruby
2
star
88

nrpe_exporter

Go
2
star
89

criteo-dotnet-blog

C#
2
star
90

criteo-python-marketing-transition-sdk

Python
2
star
91

criteo-java-marketing-transition-sdk

Java
2
star
92

pgwrr

Python
2
star
93

privacy

2
star
94

knife-ssh-agent

Authenticate to a chef server using a SSH agent
Ruby
2
star
95

carbonate-utils

Utilities for carbonate - resync whisper easilly
Python
2
star
96

sonic-saltstack

Saltstack modules for SONiC
Python
2
star
97

mesos-modules-ruby

A simple way to use ruby script as mesos modules
C++
2
star
98

ios-events-sdk

Objective-C
2
star
99

marathon-capabilities-plugin

A plugin to allow marathon to leverage mesos capabilities isolator
Scala
2
star
100

node-disruption-controller

Go
2
star