• Stars
    star
    584
  • Rank 76,554 (Top 2 %)
  • Language
    C++
  • License
    MIT License
  • Created almost 7 years ago
  • Updated almost 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Navigating Spreading-out Graph For Approximate Nearest Neighbor Search

NSG : Navigating Spread-out Graph For Approximate Nearest Neighbor Search

Table of Contents

Introduction

NSG is a graph-based approximate nearest neighbor search (ANNS) algorithm. It provides a flexible and efficient solution for the metric-free large-scale ANNS on dense real vectors. It implements the algorithm of our PVLDB paper - Fast Approximate Nearest Neighbor Search With The Navigating Spread-out Graphs. NSG has been intergrated into the search engine of Taobao (Alibaba Group) for billion scale ANNS in E-commerce scenario.

Performance

Datasets

  • SIFT1M and GIST1M
  • Synthetic datasets: RAND4M and GAUSS5M
    • RAND4M: 4 million 128-dimension vectors sampled from a uniform distribution of [-1, 1].
    • GAUSS5M: 5 million 128-dimension vectors sampled from a gaussion ditribution N(0,3).

Compared Algorithms

Graph-based ANNS algorithms:

  • kGraph
  • FANNG : FANNG: Fast Approximate Nearest Neighbour Graphs
  • HNSW (code) : Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs
  • DPG (code) : Approximate Nearest Neighbor Search on High Dimensional Data --- Experiments, Analyses, and Improvement (v1.0)
  • EFANNA (code) : EFANNA: An Extremely Fast Approximate Nearest Neighbor Search Algorithm Based on kNN Graph
  • NSG-naive: a designed based-line, please refer to our PVLDB paper.
  • NSG: This project, please refer to our PVLDB paper.

Other popular ANNS algorithms:

Results

NSG achieved the best search performance among all the compared algorithms on all the four datasets. Among all the graph-based algorithms, NSG has the smallest index size and the best search performance.

NOTE: The performance was tested without parallelism (search one query at a time and no multi-threads)

SIFT1M-100NN-All-Algorithms

SIFT1M-100NN-All-Algorithms

SIFT1M-100NN-Graphs-Only

SIFT1M-100NN-Graphs-Only

GIST1M-100NN-All-Algorithms

GIST1M-100NN-All-Algorithms

GIST1M-100NN-Graphs-Only

GIST1M-100NN-Graphs-Only

RAND4M-100NN-All-Algorithms

RAND4M-100NN-All-Algorithms

RAND4M-100NN-Graphs-Only

RAND4M-100NN-Graphs-Only

GAUSS5M-100NN-All-Algorithms

GAUSS5M-100NN-All-Algorithms

GAUSS5M-100NN-Graphs-Only

GAUSS5M-100NN-Graphs-Only

DEEP1B-100NN

DEEP1B-100NN

Building Instruction

Prerequisites

  • GCC 4.9+ with OpenMP
  • CMake 2.8+
  • Boost 1.55+
  • TCMalloc

IMPORTANT NOTE: this code uses AVX-256 intructions for fast distance computation, so your machine MUST support AVX-256 intructions, this can be checked using cat /proc/cpuinfo | grep avx2.

Compile On Ubuntu/Debian

  1. Install Dependencies:
$ sudo apt-get install g++ cmake libboost-dev libgoogle-perftools-dev
  1. Compile NSG:
$ git clone https://github.com/ZJULearning/nsg.git
$ cd nsg/
$ mkdir build/ && cd build/
$ cmake -DCMAKE_BUILD_TYPE=Release ..
$ make -j

(Optional) Docker Usage

  1. Build Docker Image
$ docker build -t nsg .
  1. Run and log into Docker container
$ docker run -it --name nsg nsg bash

You can modify the Dockerfile under the project as you need.

Usage

The main interfaces and classes have its respective test codes under directory tests/

Building NSG Index

To use NSG for ANNS, an NSG index must be built first. Here are the instructions for building NSG.

Step 1. Build kNN Graph

Firstly, we need to prepare a kNN graph.

We suggest you use our efanna_graph to build this kNN graph. But you can also use any alternatives you like, such as KGraph or faiss.

Step 2. Convert kNN Graph to NSG

Secondly, we will convert the kNN graph to our NSG index.

You can use our demo code to achieve this converstion as follows:

$ cd build/tests/
$ ./test_nsg_index DATA_PATH KNNG_PATH L R C NSG_PATH
  • DATA_PATH is the path of the base data in fvecs format.
  • KNNG_PATH is the path of the pre-built kNN graph in Step 1..
  • L controls the quality of the NSG, the larger the better.
  • R controls the index size of the graph, the best R is related to the intrinsic dimension of the dataset.
  • C controls the maximum candidate pool size during NSG contruction.
  • NSG_PATH is the path of the generated NSG index.

Searching via NSG Index

Here are the instructions of how to use NSG index for searching.

You can use our demo code to perform kNN searching as follows:

$ cd build/tests/
$ ./test_nsg_optimized_search DATA_PATH QUERY_PATH NSG_PATH SEARCH_L SEARCH_K RESULT_PATH
  • DATA_PATH is the path of the base data in fvecs format.
  • QUERY_PATH is the path of the query data in fvecs format.
  • NSG_PATH is the path of the pre-built NSG index in previous section.
  • SEARCH_L controls the quality of the search results, the larger the better but slower. The SEARCH_L cannot be samller than the SEARCH_K
  • SEARCH_K controls the number of result neighbors we want to query.
  • RESULT_PATH is the query results in ivecs format.

There is another program in tests/ folder named test_nsg_search. The parameters of test_nsg_search are exactly same as test_nsg_optimized_search. test_nsg_search is slower than test_nsg_optimized_search but requires less memory. In the situations memory consumption is extremely important, one can use test_nsg_search instead of test_nsg_optimized_search.

NOTE: Only data-type int32 and float32 are supported for now.

HINT: The data_align() function we provided is essential for the correctness of our procedure, because we use SIMD instructions for acceleration of numerical computing such as AVX and SSE2. You should use it to ensure your data elements (feature) is aligned with 8 or 16 int or float. For example, if your features are of dimension 70, then it should be extend to dimension 72. And the last 2 dimension should be filled with 0 to ensure the correctness of the distance computing. And this is what data_align() does.

HINT: Please refer here for the desciption of fvecs/ivecs format.

Parameters used in Our Paper

NSG Building

We use the following parameters to get the index in Fig. 6 of our paper.

We use efanna_graph to build the kNN graph.

Step 1. Build kNN Graph

Dataset K L iter S R
SIFT1M 200 200 10 10 100
GIST1M 400 400 12 15 100
  • Commands:
$ efanna_graph/tests/test_nndescent sift.fvecs sift_200nn.graph 200 200 10 10 100    # SIFT1M
$ efanna_graph/tests/test_nndescent gist.fvecs gist_400nn.graph 400 400 12 15 100    # GIST1M

Step 2. Convert kNN Graph to NSG

  • Parameters:
Dataset L R C
SIFT1M 40 50 500
GIST1M 60 70 500
  • Commands:
$ nsg/build/tests/test_nsg_index sift.fvecs sift_200nn.graph 40 50 500 sift.nsg        # SIFT1M
$ nsg/build/tests/test_nsg_index gist.fvecs gist_400nn.graph 60 70 500 gist.nsg        # GIST1M

Pre-built kNN Graph and NSG Index

Here we also provide our pre-built kNN graph and NSG index files used in our papar's experiments.

Performance on Taobao's E-commerce Data

Environments:

  • CPU: Xeon E5-2630.

Single Thread Test:

  • Dataset: 10,000,000 128-dimension vectors.
  • Latency: 1ms (average) on 10,000 query.

Distributed Search Test:

  • Dataset: 45,000,000 128-dimension vectors. Distribute: randomly divide the dataset into 12 subsets and build 12 NSGs. Search in parallel and merge results.
  • Latency: 1ms (average) on 10,000 query.

Reference

Reference to cite when you use NSG in a research paper:

@article{FuNSG17,
  author    = {Cong Fu and Chao Xiang and Changxu Wang and Deng Cai},
  title     = {Fast Approximate Nearest Neighbor Search With The Navigating Spreading-out Graphs},
  journal   = {{PVLDB}},
  volume    = {12},
  number    = {5},
  pages     = {461 - 474},
  year      = {2019},
  url       = {http://www.vldb.org/pvldb/vol12/p461-fu.pdf},
  doi       = {10.14778/3303753.3303754}
}

TODO

  • Add Docker support
  • Improve compatibility of SIMD-related codes
  • Python wrapper
  • Add travis CI

License

NSG is MIT-licensed.

More Repositories

1

pixel_link

Implementation of our paper 'PixelLink: Detecting Scene Text via Instance Segmentation' in AAAI2018
Python
767
star
2

MatlabFunc

Matlab codes for feature learning
MATLAB
502
star
3

ttfnet

Python
481
star
4

efanna

fast library for ANN search and KNN graph construction
C++
280
star
5

RMI

This is the code for the NeurIPS 2019 paper Region Mutual Information Loss for Semantic Segmentation.
Python
268
star
6

resa

Implementation of our paper 'RESA: Recurrent Feature-Shift Aggregator for Lane Detection' in AAAI2021.
Python
175
star
7

time_lstm

Python
152
star
8

MaxSquareLoss

Code for "Domain Adaptation for Semantic Segmentation with Maximum Squares Loss" in PyTorch.
Python
109
star
9

SSG

code for satellite system graphs
C++
95
star
10

efanna_graph

an Extremely Fast Approximate Nearest Neighbor graph construction Algorithm framework
C++
79
star
11

graph_level_drug_discovery

Python
60
star
12

CariFaceParsing

Code for ICIP2019 paper:Weakly-supervised Caricature Face Parsing through Domain Adaptation
Python
55
star
13

AtSNE

Anchor-t-SNE for large-scale and high-dimension vector visualization
Cuda
54
star
14

depthInpainting

Depth Image Inpainting with Low Gradient Regularization
C++
50
star
15

ALDA

Code for "Adversarial-Learned Loss for Domain Adaptation"(AAAI2020) in PyTorch.
Python
49
star
16

AttentionZSL

Codes for Paper "Attribute Attention for Semantic Disambiguation in Zero-Shot Learning"
Python
44
star
17

ReDR

Code for ACL 2019 paper "Reinforced Dynamic Reasoning for Conversational Question Generation".
Python
41
star
18

hashingSearch

Search with a hash index
C++
31
star
19

SRDet

A simple, fast, efficient and end-to-end 3D object detector without NMS.
Python
30
star
20

PTL

Progressive Transfer Learning for Person Re-identification published on IJCAI-2019
Python
26
star
21

TreeAttention

A Better Way to Attend: Attention with Trees for Video Question Answering
Python
25
star
22

RPLSH

Kmeans Quantization + Random Projection based Locality Sensitive Hashing
C++
23
star
23

videoqa

Unifying the Video and Question Attentions for Open-Ended Video Question Answering
Python
21
star
24

DMP

Code for ACL 2018 paper "Discourse Marker Augmented Network with Reinforcement Learning for Natural Language Inference".
Python
17
star
25

DREN

DREN:Deep Rotation Equivirant Network
C++
15
star
26

Attention-GRU-3M

Python
12
star
27

AMI

Python
7
star
28

Sparse-Learning-with-Stochastic-Composite-Optimization

The implementation of our work "Sparse Learning with Stochastic Composite Optimization"
MATLAB
7
star
29

TransAt

Python
6
star
30

diverse_image_synthesis

PyTorch implementation of diverse conditional image synthesis
Python
4
star
31

DeAda

Decouple Co-adaptation: Classifier Randomization for Person Re-identification published on Neurocomputing.
Python
3
star
32

AdaDB

Python
2
star
33

SIF

SIF: Self-Inspirited Feature Learning for Person Re-Identification published on IEEE TIP
Python
2
star
34

SIFS

C++
1
star
35

SplitNet

Jupyter Notebook
1
star