• Stars
    star
    117
  • Rank 300,025 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for "Active Learning from the Web" (WWW 2023)

Active Learning from the Web (WWW 2023)

We propose Seafaring, a method for acquiring useful data for training machine learning models by regarding the myriad data on the Web as a huge pool of active learning.

Paper: https://arxiv.org/abs/2210.08205

💿 Dependency

Please install

  • wget and unzip, e.g., by sudo apt install wget unzip,
  • PyTorch from the official website, and
  • other dependencies by pip install -r requirements.txt.

📂 Files

  • download_and_preprocess.sh downloads and preprocesses the Open Image dataset.
  • main.py runs Seafaring and baseline methods.
  • methods.py implements Seafaring and baseline methods.
  • tiara.py implements Tiara, i.e., the backbone algorithm of Seafaring.
  • utils.py implements miscellaneous functions, i.e., the word embbeding loader.

🗃️ Download and Preprocess Datasets

$ bash ./download_and_preprocess.sh

Note that it may take several hours to days.

🧪 Evaluation

Try with Open Image datasets by

$ python main.py --device cuda --initdata 1 --nround 100 --budget_per_round 1 --method Random --env OpenImage --tiara_budget 1000 --poslabels Carnivore --seed 0
$ python main.py --device cuda --initdata 1 --nround 100 --budget_per_round 1 --method SmallExact --env OpenImage --tiara_budget 1000 --poslabels Carnivore --seed 0
$ python main.py --device cuda --initdata 1 --nround 100 --budget_per_round 1 --method Seafaring --env OpenImage --tiara_budget 1000 --poslabels Carnivore --seed 

Try with Flickr by

$ python main.py --device cuda --initdata 1 --nround 100 --budget_per_round 1 --method SmallExact --env Flickr --tiara_budget 100 --apikey [YOUR_API_KEY] --initialtags flickr_objects/initial_tags.txt --user 0 --threshold 0.78
$ python main.py --device cuda --initdata 1 --nround 100 --budget_per_round 1 --method Seafaring --env Flickr --tiara_budget 100 --apikey [YOUR_API_KEY] --initialtags flickr_objects/initial_tags.txt --user 0 --threshold 0.78

The results are saved in results directiory.

Please refer to the help command for further options.

$ python main.py -h
usage: main.py [-h] [--seed SEED] [--method {Seafaring,Random,SmallExact}]
               [--env {OpenImage,Flickr}] [--apikey APIKEY]
               [--tiara_budget TIARA_BUDGET]
               [--budget_per_round BUDGET_PER_ROUND] [--initdata INITDATA]
               [--testdata TESTDATA] [--nround NROUND] [--nepoch NEPOCH]
               [--alpha ALPHA] [--threshold THRESHOLD] [--batchsize BATCHSIZE]
               [--poolsize POOLSIZE] [--device DEVICE]
               [--poslabels POSLABELS [POSLABELS ...]] [--user USER]
               [--initialtags INITIALTAGS] [--resdir RESDIR]

optional arguments:
  -h, --help            show this help message and exit
  --seed SEED
  --method {Seafaring,Random,SmallExact}
  --env {OpenImage,Flickr}
  --apikey APIKEY       API key of Flickr. Valid only for Flickr env.
  --tiara_budget TIARA_BUDGET
  --budget_per_round BUDGET_PER_ROUND
  --initdata INITDATA   NumSizeber of the initial labelled data.
  --testdata TESTDATA   Size of the test dataset.
  --nround NROUND       Number of rounds of active learning.
  --nepoch NEPOCH       Number of epochs for training the target model.
  --alpha ALPHA         The alpha parameter of Tiara.
  --threshold THRESHOLD
                        Thoreshold of Positive data. Valid only for Flickr
                        env.
  --batchsize BATCHSIZE
  --poolsize POOLSIZE   Size of the poolsize for SmallExact method
  --device DEVICE
  --poslabels POSLABELS [POSLABELS ...]
                        List of positive labels. Valid only for OpenImage env.
  --user USER           Id of the target virtual user, i.e., category. Valid
                        only for Flickr env. See also create_virtual_users.py.
  --initialtags INITIALTAGS
                        Path to the tag file.
  --resdir RESDIR

Flickr API

The Flickr experiments require a Flickr API key. Please get a key from Flickr official website.

Results

Seafaring outperforms the baseline methods in the OpenImage benchmark.

Seafaring outperforms the traditional approach of active leanring in the Flickr environment, which contains more than 10 billion images.

Please refer to the paper for more details.

🖋️ Citation

@inproceedings{sato2023active,
  author    = {Ryoma Sato},
  title     = {Active Learning from the Web},
  booktitle = {Proceedings of the Web Conference 2023, {WWW}},
  year      = {2023},
}

More Repositories

1

clear

A fully user-side image search engine. Accepted to CIKM 2022 demo track.
JavaScript
250
star
2

wordtour

Code for "Word Tour: One-dimensional Word Embeddings via the Traveling Salesman Problem" (NAACL 2022)
Python
92
star
3

otbook

書籍『最適輸送の理論とアルゴリズム』のサポートページです。
Jupyter Notebook
89
star
4

gnnbook

書籍『グラフニューラルネットワーク』のサポートサイトです。
Jupyter Notebook
49
star
5

private-recsys

Code for "Private Recommender Systems: How Can Users Build Their Own Fair Recommender Systems without Log Data?" (SDM 2022)
Python
47
star
6

reeval-wmd

Code for "Re-evaluating Word Mover’s Distance" (ICML 2022)
Python
38
star
7

chainer-PGGAN

Progressive Growing of GANs implemented with chainer
Python
33
star
8

ConvLSTM

Convolutional LSTM implemented with chainer
Python
31
star
9

laf

Code for "Training-free Graph Neural Networks and the Power of Labels as Features" (TMLR 2024)
Python
29
star
10

chainer-ETTTS

This is an implementation of "Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention" with chainer.
Python
28
star
11

speedbook

書籍『深層ニューラルネットワークの高速化』のサポートサイトです。
Jupyter Notebook
20
star
12

gnnrecover

Code for "Graph Neural Networks can Recover the Hidden Features Solely from the Graph Structure" (ICML 2023)
Python
19
star
13

le3hw

intel を越える
C++
15
star
14

treegkr

Code for "Fast Unbalanced Optimal Transport on a Tree" (NeurIPS 2020)
C++
12
star
15

fape

Code for "Enumerating Fair Packages for Group Recommendations" (WSDM 2022)
Python
8
star
16

tiara

Code for "Retrieving Black-box Optimal Images from External Databases" (WSDM 2022)
Python
6
star
17

consul

Code for "Towards Principled User-side Recommender Systems" (CIKM 2022)
Python
5
star
18

mugenyuichan

C++
4
star
19

HiSampler

HiSampler: Learning to Sample Hard Instances for Graph Algorithms
C++
4
star
20

twitter_illust_collector

This script collects 2d illust from twitter lists and posts them to slack.
Python
3
star
21

kirara-slack

まんがタイムきらら系列誌の発売日をSlackに通知します。
Python
3
star
22

twinpaper

Code for "Twin Papers: A Simple Framework of Causal Inference for Citations via Coupling" (CIKM 2022)
Python
2
star
23

anchor-energy

Fast and Robust Comparison of Probability Measures in Heterogeneous Spaces
2
star
24

ex4-ARMemulator

ARM emulator for ex4
Python
1
star
25

le2sw

実験2ソフトウェア実験
Java
1
star
26

poincare

This project aims to recommend publication venues to scientific papers.
Python
1
star
27

prism

Code for "Making Translators Privacy-aware on the User's Side" (TMLR 2024)
Python
1
star