• Stars
    star
    136
  • Rank 267,670 (Top 6 %)
  • Language
  • Created about 4 years ago
  • Updated almost 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Semantic search with embeddings: index anything

awesome-semantic-search

In Semantic search with embeddings, I described how to build semantic search systems (also called neural search). These systems are being used more and more with indexing techniques improving and representation learning getting better every year with new deep learning papers. The medium post explain how to build them, and this list is meant to reference all interesting resources on the topic to allow anyone to quickly start building systems.

image

  • Tutorials explain in depth how to build semantic search systems
  • Good datasets to build semantic search systems
    • Tensorflow datasets building search systems only requires image or text, many tf datasets are interesting in that regard
    • Torchvision datasets datasets provided for vision are also interesting for this
  • Pretrained encoders make it possible to quickly build a new system without training
    • Vision+Language
      • Clip encode image and text in a same space
    • Image
      • Efficientnet b0 is a simple way to encode images
      • Dino is an encoder trained using self supervision which reaches high knn classification performance
      • Face embeddings compute face embeddings
    • Text
      • Labse a bert text encoder trained for similarity that put sentences from 109 in the same space
    • Misc
      • Jina examples provide example on how to use pretrained encoders to build search systems
      • Vectorhub image, text, audio encoders
  • Similarity learning allows you to build new similarity encoders
  • Indexing and approximate knn: indexing make it possible to create small indices encoding million of embeddings that can be used to query the data in milli seconds
    • Faiss Many aknn algorithms (ivf, hnsw, flat, gpu, …) in c++ with a python interface
    • Autofaiss to use faiss easily
    • Nmslib fast implementation of hnsw
    • Annoy a aknn algorithm by spotify
    • Scann a aknn algorithm faster than hnsw by google
    • Catalyzer training the quantizer with backpropagation
    • hora approximate knn implemented in rust
  • Search pipelines allow fast serving and customization of how the indices are queries
    • Milvus end to end similarity engine, on top of faiss and hnswlib
    • Jina flexible end to end similarity engine
    • Haystack question answering on text pipeline
  • Companies: many companies are being built around semantic search systems
    • Jina is building flexible pipeline to encode and search with embeddings
    • Weaviate is building a cloud-native vector search engine
    • Pinecone a startup building databases indexing embeddings
    • Vector ai is building an encoder hub
    • Milvus builds an end to end open source semantic search system
    • FeatureForm's embeddinghub combining DB and KNN
    • vespa knn-based managed retrieval engine
    • Many other companies are using these systems and releasing open tools on the way, and it would be too long a list to put them here (for example facebook with faiss and self supervision, google with scann and thousand of papers, microsoft with sptag, spotify with annoy, criteo with rsvd, deepr, autofaiss, …)

More Repositories

1

img2dataset

Easily turn large sets of image urls to an image dataset. Can download, resize and package 100M urls in 20h on one machine.
Python
3,610
star
2

clip-retrieval

Easily compute clip embeddings and build a clip retrieval system with them
Jupyter Notebook
2,361
star
3

cc2dataset

Easily convert common crawl to a dataset of caption and document. Image/text Audio/text Video/text, ...
Python
303
star
4

laion-prepro

Get hundred of million of image+url from the crawling at home dataset and preprocess them
Python
202
star
5

image_embeddings

Using efficientnet to provide embeddings for retrieval
Jupyter Notebook
152
star
6

MinecraftChat

Minecraft web based chat client
JavaScript
101
star
7

embedding-reader

Efficiently read embedding in streaming from any filesystem
Python
94
star
8

rbot

bot made with mineflayer which can do task
JavaScript
81
star
9

gpu-tester

gpu tester detects broken and slow gpus in a cluster
Python
64
star
10

dalle-service

Dalle service
JavaScript
50
star
11

any2dataset

Turn any collection of files into a dataset
Python
42
star
12

python-template

Simple python template
Python
40
star
13

audio2dataset

Easily turn large sets of audio urls to an audio dataset.
Python
20
star
14

sshd_android

How to access your android phone from anywhere using ssh
14
star
15

kaggle-fashion-dalle

Kaggle fashion dataset in dalle format
Jupyter Notebook
13
star
16

static-ondisk-kv

Simple and fast implementation of a static on disk key value store, in python
Python
9
star
17

all-clip

Load any clip model with a standardized interface
Python
9
star
18

slurm-tracking-bot

Simple slurm tracking bot to check usage
Python
8
star
19

web-minecraft-crafter

A web interface to minecraft crafter
JavaScript
8
star
20

minecraft-schematics-dataset

Minecraft schematics dataset
Jupyter Notebook
8
star
21

word_knn

Quickly find closest words using an efficient knn and word embeddings
Python
6
star
22

parse-wikitext

A simple wikitext parser in node.js
JavaScript
6
star
23

node-fernflower

Simple fernflower java decompiler wrapper
JavaScript
5
star
24

wct-datatables-net

Datatables.net as a webcomponent
JavaScript
5
star
25

node-corenlp-client

Simple corenlp client to the corenlp http server using request-promise
JavaScript
5
star
26

node-minecraft-proxies

Create minecraft proxies in node.js
JavaScript
5
star
27

flying-squid-schematic

Flying-squid plugin providing /listSchemas and /loadSchema commands.
JavaScript
4
star
28

minecraft-crafter

Tells you how to get any item by crafting in minecraft
JavaScript
4
star
29

flying-squid-irc

Make a bridge between flying-squid and an IRC channel.
JavaScript
4
star
30

TvSeriesOrganizer

Application targetting desktop and mobile to organize your tv series
QML
4
star
31

tensorflow_captcha_solver

Captcha solver based on https://medium.com/@ageitgey/how-to-break-a-captcha-system-in-15-minutes-with-machine-learning-dbebb035a710
Python
4
star
32

minecraft-schematic-crawler

Automatic minecraft schematic crawler for bots and ML
JavaScript
4
star
33

PersonalKnowledgeBase

Storing data about people.
4
star
34

adjective-animal

Generate an adjective-animal name !
JavaScript
4
star
35

auto-squid

Auto update and start flying-squid
Shell
4
star
36

rom1504.github.io

Personal website
3
star
37

minespy

Spy everybody with your minecraft proxy
JavaScript
3
star
38

ideas

Ideas
3
star
39

npm-safeguard

Download the most popular npm packages and check if they have accidentally published dot files
JavaScript
3
star
40

FaceRecognition

A program made using perl, bash, c++, opencv and libsvm which make it possible to automatically recognize faces.
Perl
3
star
41

imlb

Instant Messaging Logs Base : store and make available all your instant messages
3
star
42

schematic-to-world

Load a minecraft schematic into prismarine world
JavaScript
3
star
43

distributed-shuffle

A simple implementation of distributed shuffle, intended for learning
Python
2
star
44

AutoTathamet

Create Diablo2 bots with a powerful, stable, and high level JavaScript API.
JavaScript
2
star
45

minecraft-task-graph

Define a graph of tasks for minecraft
2
star
46

deepfashion_to_tfrecords

Convert deepfashion to tfrecords to learn multimodal models
Jupyter Notebook
2
star
47

rom1504

Profile readme
2
star
48

voxel-prismarine-world

An experimental prismarine-world visualizer using voxeljs.
JavaScript
2
star
49

mcpe-protocol-extractor

Extract MCPE protocol from pocketmine
JavaScript
2
star
50

BinaryTreeExample

This is an example for the GenericBinaryTree lib
C++
2
star
51

getSubtitle

Allow you to easily get tv show english subtitle from the command line from addic7ed.
Perl
2
star
52

fromconfig-mlflow

A fromconfig Launcher for MlFlow
Python
1
star
53

SignalList

A list container built around QList that emit signals when add,delete,.. methods are called.
C++
1
star
54

CorganoBot

@Corgano's minecraft bot
JavaScript
1
star
55

testing_repo

Just tests
1
star
56

autofaiss_rom1504

Automatically create Faiss knn indices with the most optimal similarity search parameters.
Python
1
star
57

MasonJar

NodeJS Minecraft implementation used on 8BitBlocks 2.0
JavaScript
1
star
58

ReVerbHttp

A simple http server to query ReVerb
Java
1
star
59

ChineseNumber

A chinese number converter in c++/Qt with unit tests
C++
1
star
60

pascal_interpreter

Make pascal graph call, pascal interpreter and compiler to c
OpenEdge ABL
1
star
61

DBpediaPerl

A very simple perl module which allow you to query the DBpedia sparql endpoint.
Perl
1
star
62

getQuotesSmooth

get quotes from smoothirc.net
JavaScript
1
star
63

node-facebook-import

Import facebook logs into a database.
JavaScript
1
star
64

rom1504.fr

My site
HTML
1
star
65

rcontact

Gestionnaire de contacts
C++
1
star
66

BotIrssi

Un bot irc proposant des jeux et autres fonctionnalités, plugin irssi
Perl
1
star
67

FaceDetect

A program that uses opencv, bash, perl, c++ and detect faces in pictures.
Perl
1
star
68

JsonConv

Convert Json to xml and sql
TeX
1
star
69

GenericBinaryTree

This is a generic binary tree implementation and a viewer of these Tree for Qt
C++
1
star
70

moteurPhysique

Gestion de plusieurs entités et de leur déplacement. On peut aussi construire une unité à partir du batiment.
C++
1
star
71

keras-square-function-estimator

A simple example on estimating the square function in keras
Python
1
star
72

RelExHttp

A simple http server to query RelEx
Java
1
star
73

faiss-java

Maven package for faiss
Java
1
star
74

FaceRecognitionInterface

A software that handle the whole process of tagging people on pictures.
C++
1
star
75

my-github-backups

Backup of my github projects
1
star
76

SimpleEditor

A simple editor made with Qt
C++
1
star
77

distributed-translator

Translate millions of captions to hundred of languages efficiently
Python
1
star
78

TvSeriesOrganizerPluginInterface

Allow plugin to interact on an episode
C++
1
star
79

GeneralQmlItems

Some useful general Qml Items
IDL
1
star
80

ngengine

A 2D/3D Game Engine (C++, OpenGL, Glm).
C++
1
star
81

TvSeriesAPI

A c++ Qt API providing series data from thetvdb and trakt
C++
1
star
82

FreebasePerl

A very simple perl module which allow you to query the freebase database.
Perl
1
star
83

client_irc

Client irc built with Qt (inspired by xchat)
C++
1
star
84

node-voxel-worldgen

A voxel world generator written in Rust, with bindings for JavaScript
Rust
1
star
85

node-raknet

UDP network library that follows the RakNet protocol for Node.js
JavaScript
1
star
86

freehex

An hex game
JavaScript
1
star
87

ecosysteme

Une sorte de simulation d'écosystème codé en c++ avec SDL
C++
1
star