• Stars
    star
    127
  • Rank 282,790 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 2 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[ECCV 2022] Multi-Domain Long-Tailed Recognition, Imbalanced Domain Generalization, and Beyond

Multi-Domain Long-Tailed Recognition (MDLT)

This repository contains the implementation for paper: On Multi-Domain Long-Tailed Recognition, Imbalanced Domain Generalization and Beyond (ECCV 2022).

It is also a (living) PyTorch suite containing benchmark datasets and algorithms for Multi-Domain Long-Tailed Recognition (MDLT). Currently we support 8 MDLT datasets (3 synthetic + 5 real), as well as ~20 algorithms that span different learning strategies. Feel free to send us a PR to add your algorithm / dataset for MDLT!



Multi-Domain Long-Tailed Recognition (MDLT) aims to learn from multi-domain imbalanced data, address label imbalance, domain shift, and divergent label distributions across domains, and generalize to all domain-class pairs.

MDLT: From Single- to Multi-Domain Imbalanced Learning

Existing studies on data imbalance focus on single-domain settings, i.e., samples are from the same data distribution. However, natural data can originate from distinct domains, where a minority class in one domain could have abundant instances from other domains. We systematically investigate Multi-Domain Long-Tailed Recognition (MDLT), which learns from multi-domain imbalanced data, addresses label imbalance, domain shift, and divergent label distributions across domains, and generalizes to all domain-class pairs.

We develop the domain-class transferability graph, and show that such transferability governs the success of learning in MDLT. We then propose BoDA, a theoretically grounded learning strategy that tracks the upper bound of transferability statistics, and ensures balanced alignment and calibration across imbalanced domain-class distributions. We curate MDLT benchmark datasets based on widely-used multi-domain datasets, and benchmark ~20 algorithms that span different learning strategies for MDLT.

Beyond MDLT: Domain Generalization under Data Imbalance

Further, as a byproduct, we demonstrate that BoDA strengthens Domain Generalization (DG) algorithms, and consistently improves the results on DG benchmarks. Note that all current standard DG benchmarks naturally exhibit heavy class imbalance within domains and label distributions shift across domains, confirming that data imbalance is an intrinsic problem in DG, but has yet been overlooked by past works.

The results shed light on how label imbalance can affect out-of-distribution generalization, and highlight the importance of integrating label imbalance into practical DG algorithm design.

Getting Started

Installation

Prerequisites

  1. Download the original datasets, and place them in your data_path
python -m mdlt.scripts.download --data_dir <data_path>
  1. Place the .csv files of train/val/test splits for each MDLT dataset (provided in mdlt/dataset/split/) in the corresponding dataset folder under your data_path

Dependencies

  1. PyTorch (>=1.4, tested on 1.4 / 1.9)
  2. pandas
  3. TensorboardX

Code Overview

Main Files

  • train.py: main training script
  • sweep.py: launch a sweep with all selected algorithms (provided in mdlt/learning/algorithms.py) on all real MDLT datasets (VLCS-MLT, PACS-MLT, OfficeHome-MLT, TerraInc-MLT, DomainNet-MLT)
  • sweep_synthetic.py: launch a sweep with all selected algorithms on the synthetic MDLT dataset (Digits-MLT)
  • collect_results.py: collect sweep results to automatically generate result tables (as in the paper)
  • eval_best_hparam.py & eval_checkpoint.py: scripts for evaluating trained models

Main Arguments

  • train.py:
    • --dataset: name of chosen MDLT dataset
    • --algorithm: choose algorithm used for running
    • --data_dir: data path
    • --output_dir: output path
    • --output_folder_name: output folder name (under output_dir) for the current run
    • --hparams_seed: seed for different hyper-parameters
    • --seed: seed for different runs
    • --selected_envs: train on selected envs (only used for Digits-MLT)
    • --imb_type & --imb_factor: arguments for customized Digits-MLT label distributions
    • --stage1_folder & --stage1_algo: arguments for two-stage algorithms
  • sweep.py:
    • --n_hparams: how many hparams to run for each <dataset, algorithm> pair
    • --best_hp & --n_trials: after sweeping hparams, fix best hparam and run trials with different seeds

Usage

Train a single model

python -m mdlt.train --algorithm <algo> --dataset <dset> --output_folder_name <output_folder_name> --data_dir <data_path> --output_dir <output_path>

Train a model using 2-stage (second stage classifier learning)

python -m mdlt.train --algorithm CRT --dataset <dset> --output_folder_name <output_folder_name> --data_dir <data_path> --output_dir <output_path> --stage1_folder <stage1_model_folder> --stage1_algo <stage1_algo>

Note that for $\text{BoDA}_{r,c}$ the command is the same as above, with changes only on stage1_algo & stage1_folder

Train a model on Digits-MLT, with imbalance type all Forward-LT and imbalance ratio 0.01

python -m mdlt.train --algorithm <algo> --dataset ImbalancedDigits \
       --imb_type eee \
       --imb_factor 0.01 \
       --selected_envs 1 2

Note that for Digits-MLT, we additionally provide MNIST as another domain. To maintain the same setting as in paper (2 domains), you only need to set selected_envs to be 1 2 as above

Launch a sweep with different hparams

python -m mdlt.sweep launch --algorithms <...> --dataset <...> --n_hparams <num_of_hparams> --n_trials 1

Launch a sweep after fixing hparam with different seeds

python -m mdlt.sweep launch --algorithms <...> --dataset <...> --best_hp --input_folder <...> --n_trials <num_of_trials>

Collect the results of your sweep

python -m mdlt.scripts.collect_results --input_dir <...>

Evaluate the best hparam model for a <dataset, algo> pair

python -u -m mdlt.evaluate.eval_best_hparam --algorithm <...> --dataset <...> --data_dir <...> --output_dir <...> --folder_name <...>

Evaluate a trained checkpoint

python -u -m mdlt.evaluate.eval_checkpoint --algorithm <...> --dataset <...> --data_dir <...> --checkpoint <...>

Reproduced Benchmarks and Model Zoo

Model VLCS-MLT PACS-MLT OfficeHome-MLT TerraInc-MLT DomainNet-MLT
BoDA (r) 76.9 / model 97.0 / model 81.5 / model 78.6 / model 60.1 / model
BoDA (r,c) 77.3 / model 97.2 / model 82.3 / model 82.3 / model 61.7 / model

Updates

  • [10/2022] Check out the Oral talk video (10 mins) for our ECCV paper.
  • [07/2022] We create a Blog post for this work (version in Chinese is also available here). Check it out for more details!
  • [07/2022] Paper accepted to ECCV 2022. We have released the code and models.
  • [03/2022] arXiv version posted. The code is currently under cleaning. Please stay tuned for updates.

Acknowledgements

This code is partly based on the open-source implementations from DomainBed.

Citation

If you find this code or idea useful, please cite our work:

@inproceedings{yang2022multi,
  title={On Multi-Domain Long-Tailed Recognition, Imbalanced Domain Generalization and Beyond},
  author={Yang, Yuzhe and Wang, Hao and Katabi, Dina},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2022}
}

Contact

If you have any questions, feel free to contact us through email ([email protected]) or Github issues. Enjoy!

More Repositories

1

imbalanced-regression

[ICML 2021, Long Talk] Delving into Deep Imbalanced Regression
Python
802
star
2

imbalanced-semi-self

[NeurIPS 2020] Semi-Supervision (Unlabeled Data) & Self-Supervision Improve Class-Imbalanced / Long-Tailed Learning
Python
736
star
3

SimPer

[ICLR 2023, Oral] SimPer: Simple Self-Supervised Learning of Periodic Targets
Jupyter Notebook
121
star
4

SubpopBench

[ICML 2023] Change is Hard: A Closer Look at Subpopulation Shift
Python
95
star
5

ME-Net

[ICML 2019] ME-Net: Towards Effective Adversarial Robustness with Matrix Estimation
Python
51
star
6

SV-RL

[ICLR 2020, Oral] Harnessing Structures for Value-Based Planning and Reinforcement Learning
Python
34
star
7

OFDM

OFDM simulation project, using BPSK/QPSK and FIR filter.
MATLAB
26
star
8

shortcut-ood-fairness

[Nature Medicine] The Limits of Fair Medical Imaging AI In Real-World Generalization
Python
19
star
9

ImgSensingNet

[INFOCOM 2019] ImgSensingNet: UAV Vision Guided Aerial-Ground Air Quality Sensing System
Python
11
star
10

vlm-fairness

Demographic Bias of Vision-Language Foundation Models in Medical Imaging
Python
9
star
11

AQI_Dataset

A Dataset for fine-grained AQI distribution in typical 2D and 3D scenario.
4
star
12

Personal-Website

My personal website.
HTML
2
star
13

Data-Structure-and-Algorithm

Solutions for Data-Structure-and-Algorithm on POJ
C++
1
star
14

LaTeX-Templates

Templates for LaTeX files
TeX
1
star
15

Image_Processing

Methods to process image that is combined with Gaussian Noise and Obfuscation.
MATLAB
1
star
16

Microcomputer_Lab

Codes for Microcomputer Lab.
C
1
star
17

Dynamic_Webpage

A dynamic webpage implemented by Python, HTML/CSS, JavaScript, Node.js and MySQL
JavaScript
1
star