• This repository has been archived on 23/Jan/2019
  • Stars
    star
    106
  • Rank 325,811 (Top 7 %)
  • Language
    C++
  • License
    Apache License 2.0
  • Created about 7 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

AnnexML is a multi-label classifier designed for extremely large label space.

AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-Label Classification

AnnexML is a multi-label classifier designed for extremely large label space (10^4 to 10^6). At training step, AnnexML constructs k-nearest neighbor graph of the label vectors and attempts to reproduce the graph structure in the embedding space. The prediction is efficiently performed by using an approximate nearest neighbor search method which efficiently explores the learned k-nearest neighbor graph in the embedding space.

For more detail, please see the paper.

Build

A recent compiler supporting C++11 and OpenMP, such as g++, is required.

$ make -C src/ annexml

If your CPUs do not support FMA instruction set, you should comment out the line CXXFLAG += -DUSEFMA -mfma in src/Makefile before making.

Usage

Data Format

AnnexML takes multi-label svmlight / libsvm format. The datasets on The Extreme Classification Repository, which have an additional header line, are also applicable.

32,50,87 1:1.9 23:0.48 79:0.63
50,51,126 4:0.71 23:0.99 1005:0.08 1018:2.15

Training and prediction

Model parameters and some file paths are specified in a JSON file or command line arguments. The settings specified in arguments will overwrite those in the JSON file. Recommended parameters are in annexml-example.json.

Examples of training:

$ src/annexml train annexml-example.json
$ src/annexml train annexml-example.json num_thread=32   # use 32 CPU threads for training
$ src/annexml train annexml-example.json cls_type=0   # use k-means clustering for data partitioning

Examples of prediction:

$ src/annexml predict annexml-example.json
$ src/annexml predict annexml-example.json num_learner=4 num_thread=1   # use only 4 learners and 1 CPU thread for prediction
$ src/annexml predict annexml-example.json pred_type=0   # use brute-force cosine calculation

Usage of the evaluation script written in python is as follow:

$ cat annexml-result-example.txt | python scripts/learning-evaluate_predictions.py
#samples=6616
P@1=0.865175
P@2=0.803507
P@3=0.742846
P@4=0.689049
P@5=0.641717
nDCG@1=0.865175
nDCG@2=0.817462
nDCG@3=0.771536
nDCG@4=0.730631
nDCG@5=0.694269

$ cat annexml-result-example.txt | python scripts/learning-evaluate_predictions_propensity_scored.py data/Wiki10/wiki10_train.txt -A 0.55 -B 1.5 
#samples=6616
PSP@1=0.119057
PSP@2=0.122856
PSP@3=0.127683
PSP@4=0.131884
PSP@5=0.135722
PSnDCG@1=0.119057
PSnDCG@2=0.121939
PSnDCG@3=0.125388
PSnDCG@4=0.128349
PSnDCG@5=0.130996

Model Parameters and File Paths

emb_size          Dimension size of embedding vectors
num_learner       Number of learners (or models) for emsemble learning
num_nn            Number of (approximate) nearest neighbors used in training and prediction
cls_type          Algorithm type used for data partitioning
                  1 : learning procedure which finds min-cut of approximate KNNG
                  0 : k-means clustering
cls_iter          Number of epochs for data partitioning algorithms
emb_iter          Number of epochs for learning embeddings
label_normalize   Label vectors are normalized or not
eta0              Initial value of AdaGrad learning rate adjustement
lambda            L1-regularization parameter of data partitioning (only used if cls_type = 1)
gamma             Scaling parameter for cosine ([-1, 1] to [-gamma, gamma]) in learning embeddings
pred_type         Algorithm type used for prediction of k-nearest neighbor classifier
                  1 : approximate nearest neighbor search method which explores learned KNNG
                  0 : brute-force calculation
num_edge          Number of direct edges per vertex in learned KNNG (only used if pred_type = 1)
search_eps        Parameter for exploration of KNNG (only used if pred_type = 1)
num_thread        Number of CPU threads used in training and prediction
seed              Random seed
verbose           Vervosity level (ignore if num_thread > 1)

train_file        File path of training data
predict_file      File path of prediction data
model_file        File path of output model
result_file       File path of prediction result

License

Copyright (C) 2017 Yahoo Japan Corporation

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this software except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Contributor License Agreement

This project requires contributors to agree to a Contributor License Agreement (CLA).

Note that only for contributions to the AnnexML repository on the GitHub (https://github.com/yahoojapan/AnnexML), the contributors of them shall be deemed to have agreed to the CLA without individual written agreements.

Publications

  • Yukihiro Tagami. AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-label Classification. KDD 2017. (KDD Webpage)

Dependencies

AnnexML includes the following software.

Copyright © 2017 Yahoo Japan Corporation All Rights Reserved.

More Repositories

1

NGT

Nearest Neighbor Search with Neighborhood Graph and Tree for High-dimensional Data
C++
1,221
star
2

objc2swift

Open Source Obj-C to Swift Converter.
Scala
1,033
star
3

SwiftyXMLParser

Simple XML Parser implemented in Swift
Swift
575
star
4

JGLUE

JGLUE: Japanese General Language Understanding Evaluation
Python
294
star
5

UICollectionViewSplitLayout

UICollectionViewSplitLayout makes collection view more responsive.
Swift
241
star
6

yskip

Incremental Skip-gram Model with Negative Sampling
Shell
68
star
7

yosegi

Yosegi is a Schema-less columnar storage format. Provide flexible representation like JSON and efficient reading similar to other columnar storage formats.
Java
63
star
8

XCMetricsAggregator

Automation tool for Xcode Metrics Organizer with AppleScript
Ruby
62
star
9

YJCaptions

60
star
10

bakusoku-jsonp

Codeless Blog Widgets framework
JavaScript
60
star
11

AppFeedback-ios

📸 You can post feedback messages and screenshots to Slack from your iOS app! 🎥
Objective-C
42
star
12

ngtd

Serving NGT over HTTP or gRPC ※This project is not maintained. We have moved to a new product, [Vald](https://vald.vdaas.org) .
Go
38
star
13

k2hash

K2HASH - NoSQL Key Value Store(KVS) library
C++
36
star
14

authorization-proxy

Moved to https://github.com/AthenZ/authorization-proxy
Go
35
star
15

presto_exporter

Go
34
star
16

k2hftfuse

File transaction by FUSE-based file system
C++
32
star
17

gongt

NGT Go client library
Go
29
star
18

fullock

Fast User Level LOCK library
C++
26
star
19

yjlogin-ios-sdk

Yahoo! JAPAN Login iOS SDK
Swift
26
star
20

ja-vg-vqa

26
star
21

gpu-monitoring-exporter

Prometheus exporter for GPU process metrics.
Shell
26
star
22

jenkins-with-docker-demo

Shell
25
star
23

lcom4go

Compute LCOM4, Lack of Cohesion of Methods metrics ver.4, for golang projects.
Go
21
star
24

yconnect-php-sdk

YConnect PHP SDK
PHP
21
star
25

vespa-tutorial

Japanese tutorial for Vespa
Shell
20
star
26

AppFeedback-android

📸 You can post feedback messages and screenshots to Slack from your Android app! 🎥
Java
20
star
27

presto-audit

THIS REPOSITORY IS DEPRECATED
Java
19
star
28

garm

Garm is k8s authorization webhook (SubjectAccessReview API) server for Athenz. Moved to https://github.com/AthenZ/garm
Go
17
star
29

chmpx

Consistent Hashing Mq inProcess data eXchange
C++
17
star
30

docker-continuous-integration-workflow

2014/02/12 Docker Meetup in Tokyo #1 での発表内容です。
Ruby
17
star
31

geobleu

Python implementation of GEO-BLEU, a similarity evaluation method for trajectories
Python
17
star
32

MultitaskingSample

iOS 7の新機能、BackgroundFetch, SilentPushNotification, BackgroundTransferを利用したサンプルコードです。
Objective-C
16
star
33

athenz-authorizer

athenz policy management library for golang. Moved to https://github.com/AthenZ/athenz-authorizer
Go
15
star
34

athenz-client-sidecar

Moved to https://github.com/AthenZ/athenz-client-sidecar
Go
15
star
35

vespa-kuromoji-linguistics

Java
15
star
36

k2hdkc

k2hdkc is k2hash based distributed kvs cluster
C++
13
star
37

big3store

Erlang
12
star
38

VFD-Dataset

Python
11
star
39

k2htp_dtor

K2HASH Distributed Transaction Of Repeater
C++
10
star
40

textwebapi-cookbook

Cookbook for the Text Analysis Web API provided by Yahoo! DEVELOPER NETWORK.
Jupyter Notebook
10
star
41

solr-plugin-samples

Java
9
star
42

VSU-Dataset

8
star
43

yconnect-servlet-sdk

YConnect Servlet SDK
Java
8
star
44

DynamicsSample

iOS 7の新機能、UIKit Dynamics、Motion Effectsを利用したサンプルコードです。
Objective-C
6
star
45

ConfigCacheBundle

Symfony ConfigCacheBundle for easier handling of user-defined configuration file cache
PHP
6
star
46

AntPickax

AntPickax provides basic libraries, components and systems
6
star
47

yjlogin-android-sdk

Kotlin
5
star
48

chmpx_nodejs

CHMPX nodejs addon library - Consistent Hashing Mq inProcess data eXchange
C++
5
star
49

k2hr3

K2HR3 - K2Hdkc based Resource and Roles and policy Rules
5
star
50

yosegi-spark

Java
5
star
51

hubot-shuffle

hubot-shuffle add shuffle system.
CoffeeScript
5
star
52

yosegi-hive

This is Yosegi's Hive plugin. This can write and read tables with Hive.
Java
5
star
53

k2hr3_osnl

K2HR3 OpenStack Notification Listener - K2Hdkc based Resource and Roles and policy Rules
Python
4
star
54

embulk-output-solr

Java
4
star
55

fastlane-plugin-setup_app_feedback_sdk

Fastlane plugin that update Info.plist for AppFeedback SDK
Ruby
4
star
56

k2hash_phpext

PHP Extension library for K2HASH
C
4
star
57

k2hr3_utils

K2HR3 Utils - Utils for K2Hdkc based Resource and Roles and policy Rules
Shell
4
star
58

k2hr3_app

K2HR3 Web Application - K2Hdkc based Resource and Roles and policy Rules
JavaScript
4
star
59

k2hr3_api

K2HR3 REST API - K2Hdkc based Resource and Roles and policy Rules
JavaScript
4
star
60

k2htp_mdtor

K2Hash Transaction Plugin for Multiple Distributed Transaction Of Repeater
Shell
4
star
61

k2hr3_helm_chart

Helm Chart for K2HR3
Shell
3
star
62

k2hdkc_java

K2HDKC Java library - k2hash based distributed kvs cluster
Java
3
star
63

k2hdkc_go

K2HDKC Go library - k2hash based distributed kvs cluster
Go
3
star
64

yosegi-tools

Java
3
star
65

k2hash_go

K2HASH Go library - NoSQL Key Value Store(KVS) library
Go
3
star
66

yj-ci-dataset

3
star
67

k2hr3_cli

K2HR3 Command Line Interface
Shell
3
star
68

embulk-parser-xml2

Java
3
star
69

k2hr3_sidecar

K2HR3 Container Registration Sidecar - K2Hdkc based Resource and Roles and policy Rules
Shell
3
star
70

k2hdkc_python

K2HDKC Python library - k2hash based distributed kvs cluster
Python
3
star
71

k2hdkc_dbaas

Database as a Service for K2HDKC
Python
3
star
72

k2hash_python

K2HASH Python library - NoSQL Key Value Store(KVS) library
Python
3
star
73

yosegi-hadoop

Java
3
star
74

k2hdkc_nodejs

K2HDKC nodejs addon library - k2hash based distributed kvs cluster
JavaScript
3
star
75

k2hash_nodejs

K2HASH nodejs addon library - NoSQL Key Value Store(KVS) nodejs library
JavaScript
3
star
76

k2hash_java

K2HASH Java library - NoSQL Key Value Store(KVS) library
Java
3
star
77

yosegi-avro

Java
2
star
78

k2hdkc_dbaas_override_conf

K2HDKC DBaaS Override Configuration
Shell
2
star
79

k2hdkc_dbaas_k8s_cli

K2HDKC DBaaS on kubernetes Command Line Interface - K2HR3 CLI Plugin
Shell
2
star
80

k2hr3_get_resource

K2HR3 Utilities - Get K2HR3 Resource Helper for Systemd service
Shell
2
star
81

k2hdkc_dbaas_cli

K2HDKC DBaaS Command Line Interface - K2HR3 CLI Plugin
Shell
2
star
82

hubot-package-version-release

publish release on GitHub based package.json
CoffeeScript
2
star
83

k2hdkc_helm_chart

Helm Chart for K2HDKC DBaaS
Shell
2
star
84

k2hr3client_python

k2hr3client_python is an official Python WebAPI client for k2hr3.
Python
2
star
85

k2hdkc_phpext

PHP Extension library for K2HDKC
PHP
1
star
86

yosegi-example

Java
1
star
87

chmpx_phpext

PHP Extension library for CHMPX
PHP
1
star
88

yosegi-legacy

Java
1
star