AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-Label Classification

AnnexML is a multi-label classifier designed for extremely large label space (10^4 to 10^6). At training step, AnnexML constructs k-nearest neighbor graph of the label vectors and attempts to reproduce the graph structure in the embedding space. The prediction is efficiently performed by using an approximate nearest neighbor search method which efficiently explores the learned k-nearest neighbor graph in the embedding space.

For more detail, please see the paper.

Build

A recent compiler supporting C++11 and OpenMP, such as g++, is required.

$ make -C src/ annexml

If your CPUs do not support FMA instruction set, you should comment out the line CXXFLAG += -DUSEFMA -mfma in src/Makefile before making.

Usage

Data Format

AnnexML takes multi-label svmlight / libsvm format. The datasets on The Extreme Classification Repository, which have an additional header line, are also applicable.

32,50,87 1:1.9 23:0.48 79:0.63
50,51,126 4:0.71 23:0.99 1005:0.08 1018:2.15

Training and prediction

Model parameters and some file paths are specified in a JSON file or command line arguments. The settings specified in arguments will overwrite those in the JSON file. Recommended parameters are in annexml-example.json.

Examples of training:

$ src/annexml train annexml-example.json
$ src/annexml train annexml-example.json num_thread=32   # use 32 CPU threads for training
$ src/annexml train annexml-example.json cls_type=0   # use k-means clustering for data partitioning

Examples of prediction:

$ src/annexml predict annexml-example.json
$ src/annexml predict annexml-example.json num_learner=4 num_thread=1   # use only 4 learners and 1 CPU thread for prediction
$ src/annexml predict annexml-example.json pred_type=0   # use brute-force cosine calculation

Usage of the evaluation script written in python is as follow:

$ cat annexml-result-example.txt | python scripts/learning-evaluate_predictions.py
#samples=6616
P@1=0.865175
P@2=0.803507
P@3=0.742846
P@4=0.689049
P@5=0.641717
nDCG@1=0.865175
nDCG@2=0.817462
nDCG@3=0.771536
nDCG@4=0.730631
nDCG@5=0.694269

$ cat annexml-result-example.txt | python scripts/learning-evaluate_predictions_propensity_scored.py data/Wiki10/wiki10_train.txt -A 0.55 -B 1.5 
#samples=6616
PSP@1=0.119057
PSP@2=0.122856
PSP@3=0.127683
PSP@4=0.131884
PSP@5=0.135722
PSnDCG@1=0.119057
PSnDCG@2=0.121939
PSnDCG@3=0.125388
PSnDCG@4=0.128349
PSnDCG@5=0.130996

Model Parameters and File Paths

emb_size          Dimension size of embedding vectors
num_learner       Number of learners (or models) for emsemble learning
num_nn            Number of (approximate) nearest neighbors used in training and prediction
cls_type          Algorithm type used for data partitioning
                  1 : learning procedure which finds min-cut of approximate KNNG
                  0 : k-means clustering
cls_iter          Number of epochs for data partitioning algorithms
emb_iter          Number of epochs for learning embeddings
label_normalize   Label vectors are normalized or not
eta0              Initial value of AdaGrad learning rate adjustement
lambda            L1-regularization parameter of data partitioning (only used if cls_type = 1)
gamma             Scaling parameter for cosine ([-1, 1] to [-gamma, gamma]) in learning embeddings
pred_type         Algorithm type used for prediction of k-nearest neighbor classifier
                  1 : approximate nearest neighbor search method which explores learned KNNG
                  0 : brute-force calculation
num_edge          Number of direct edges per vertex in learned KNNG (only used if pred_type = 1)
search_eps        Parameter for exploration of KNNG (only used if pred_type = 1)
num_thread        Number of CPU threads used in training and prediction
seed              Random seed
verbose           Vervosity level (ignore if num_thread > 1)

train_file        File path of training data
predict_file      File path of prediction data
model_file        File path of output model
result_file       File path of prediction result

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Contributor License Agreement

This project requires contributors to agree to a Contributor License Agreement (CLA).

Note that only for contributions to the AnnexML repository on the GitHub (https://github.com/yahoojapan/AnnexML), the contributors of them shall be deemed to have agreed to the CLA without individual written agreements.

Publications

Yukihiro Tagami. AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification. KDD 2017. (KDD Webpage)

Dependencies

AnnexML includes the following software.

(2-clause BSD license) picojson

yahoojapan/AnnexML

yahoojapan

Reviews

Repository Details