• Stars
    star
    333
  • Rank 126,599 (Top 3 %)
  • Language
    Python
  • Created over 3 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Benchmark your model on out-of-distribution datasets with carefully collected human comparison data (NeurIPS 2021 Oral)

header

Benchmark β€’ Installation β€’ User experience β€’ Model zoo β€’ Datasets β€’ Credit & citation

modelvshuman: Does your model generalise better than humans?

modelvshuman is a Python toolbox to benchmark the gap between human and machine vision. Using this library, both PyTorch and TensorFlow models can be evaluated on 17 out-of-distribution datasets with high-quality human comparison data.

πŸ† Benchmark

The top-10 models are listed here; training dataset size is indicated in brackets. Additionally, standard ResNet-50 is included as the last entry of the table for comparison. Model ranks are calculated across the full range of 52 models that we tested. If your model scores better than some (or even all) of the models here, please open a pull request and we'll be happy to include it here!

Most human-like behaviour

winner model accuracy difference ↓ observed consistency ↑ error consistency ↑ mean rank ↓
πŸ₯‡ ViT-22B-384: ViT-22B (4B) .018 .783 .258 1.67
πŸ₯ˆ CLIP: ViT-B (400M) .023 .758 .281 3
πŸ₯‰ ViT-22B-560: ViT-22B (4B) .022 .739 .281 3.33
πŸ‘ SWSL: ResNeXt-101 (940M) .028 .752 .237 6
πŸ‘ BiT-M: ResNet-101x1 (14M) .034 .733 .252 7
πŸ‘ BiT-M: ResNet-152x2 (14M) .035 .737 .243 7.67
πŸ‘ ViT-L (1M) .033 .738 .222 9.33
πŸ‘ BiT-M: ResNet-152x4 (14M) .035 .732 .233 10.33
πŸ‘ BiT-M: ResNet-50x3 (14M) .040 .726 .228 12
πŸ‘ ViT-L (14M) .035 .744 .206 12
... standard ResNet-50 (1M) .087 .665 .208 31.33

Highest OOD (out-of-distribution) distortion robustness

winner model OOD accuracy ↑ rank ↓
πŸ₯‡ ViT-22B-224: ViT-22B (4B) .837 1
πŸ₯ˆ Noisy Student: EfficientNet-L2 (300M) .829 2
πŸ₯‰ ViT-22B-384: ViT-22B (4B) .798 3
πŸ‘ ViT-L (14M) .733 4
πŸ‘ CLIP: ViT-B (400M) .708 5
πŸ‘ ViT-L (1M) .706 6
πŸ‘ SWSL: ResNeXt-101 (940M) .698 7
πŸ‘ BiT-M: ResNet-152x2 (14M) .694 8
πŸ‘ BiT-M: ResNet-152x4 (14M) .688 9
πŸ‘ BiT-M: ResNet-101x3 (14M) .682 10
... standard ResNet-50 (1M) .559 34

πŸ”§ Installation

Simply clone the repository to a location of your choice and follow these steps (requires python3.8):

  1. Set the repository home path by running the following from the command line:

    export MODELVSHUMANDIR=/absolute/path/to/this/repository/
    
  2. Within the cloned repository, install package:

    pip install -e .
    

    (The -e option makes sure that changes to the code are reflected in the package, which is important e.g. if you add your own model or make any other changes)

πŸ”¬ User experience

Simply edit examples/evaluate.py as desired. This will test a list of models on out-of-distribution datasets, generating plots. If you then compile latex-report/report.tex, all the plots will be included in one convenient PDF report.

🐫 Model zoo

The following models are currently implemented:

If you e.g. add/implement your own model, please make sure to compute the ImageNet accuracy as a sanity check.

How to load a model

If you just want to load a model from the model zoo, this is what you can do:

    # loading a PyTorch model from the zoo
    from modelvshuman.models.pytorch.model_zoo import InfoMin
    model = InfoMin("InfoMin")

    # loading a Tensorflow model from the zoo
    from modelvshuman.models.tensorflow.model_zoo import efficientnet_b0
    model = efficientnet_b0("efficientnet_b0")

Then, if you have a custom set of images that you want to evaluate the model on, load those (in the example below, called images) and evaluate via:

    output_numpy = model.forward_batch(images)
    
    # by default, type(output) is numpy.ndarray, which can be converted to a tensor via:
    output_tensor = torch.tensor(output_numpy)

However, if you simply want to run a model through the generalisation datasets provided by the toolbox, we recommend to check the section on User experience.

How to list all available models

All implemented models are registered by the model registry, which can then be used to list all available models of a certain framework with the following method:

    from modelvshuman import models
    
    print(models.list_models("pytorch"))
    print(models.list_models("tensorflow"))
How to add a new model

Adding a new model is possible for standard PyTorch and TensorFlow models. Depending on the framework (pytorch / tensorflow), open modelvshuman/models/<framework>/model_zoo.py. Here, you can add your own model with a few lines of code - similar to how you would load it usually. If your model has a custom model definition, create a new subdirectory called modelvshuman/models/<framework>/my_fancy_model/fancy_model.py which you can then import from model_zoo.py via from .my_fancy_model import fancy_model.

πŸ“ Datasets

In total, 17 datasets with human comparison data collected under highly controlled laboratory conditions in the Wichmannlab are available.

Twelve datasets correspond to parametric or binary image distortions. Top row: colour/grayscale, contrast, high-pass, low-pass (blurring), phase noise, power equalisation. Bottom row: opponent colour, rotation, Eidolon I, II and III, uniform noise. noise-stimuli

The remaining five datasets correspond to the following nonparametric image manipulations: sketch, stylized, edge, silhouette, texture-shape cue conflict. nonparametric-stimuli

How to load a dataset

Similarly, if you're interested in just loading a dataset, you can do this via:

   from modelvshuman.datasets import sketch      
   dataset = sketch(batch_size=16, num_workers=4)

Note that the datasets aren't available after installing the toolbox just yet. Instead, they are automatically downloaded the first time a model is evaluated on the dataset (see examples/evaluate.py).

How to list all available datasets
    from modelvshuman import datasets
    
    print(list(datasets.list_datasets().keys()))

πŸ’³ Credit

Psychophysical data were collected by us in the vision laboratory of the Wichmannlab.

That said, we used existing image dataset sources. 12 datasets were obtained from Generalisation in humans and deep neural networks. 4 datasets were obtained from ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. Additionally, we used 1 dataset from Learning Robust Global Representations by Penalizing Local Predictive Power (sketch images from ImageNet-Sketch).

We thank all model authors and repository maintainers for providing the models described above.

Citation

@inproceedings{geirhos2021partial,
  title={Partial success in closing the gap between human and machine vision},
  author={Geirhos, Robert and Narayanappa, Kantharaju and Mitzkus, Benjamin and Thieringer, Tizian and Bethge, Matthias and Wichmann, Felix A and Brendel, Wieland},
  booktitle={{Advances in Neural Information Processing Systems 34}},
  year={2021},
}

More Repositories

1

foolbox

A Python toolbox to create adversarial examples that fool neural networks in PyTorch, TensorFlow, and JAX
Python
2,733
star
2

imagecorruptions

Python package to corrupt arbitrary images.
Python
409
star
3

siamese-mask-rcnn

Siamese Mask R-CNN model for one-shot instance segmentation
Jupyter Notebook
346
star
4

robust-detection-benchmark

Code, data and benchmark from the paper "Benchmarking Robustness in Object Detection: Autonomous Driving when Winter is Coming" (NeurIPS 2019 ML4AD)
Jupyter Notebook
182
star
5

stylize-datasets

A script that applies the AdaIN style transfer method to arbitrary datasets
Python
155
star
6

robustness

Robustness and adaptation of ImageNet scale models. Pre-Release, stay tuned for updates.
Python
128
star
7

openimages2coco

Convert Open Images annotations into MS Coco format to make it a drop in replacement
Jupyter Notebook
112
star
8

slow_disentanglement

Towards Nonlinear Disentanglement in Natural Data with Temporal Sparse Coding
Jupyter Notebook
72
star
9

frequency_determines_performance

Code for the paper: "No Zero-Shot Without Exponential Data: Pretraining Concept Frequency Determines Multimodal Model Performance" [NeurIPS'24]
Jupyter Notebook
71
star
10

AnalysisBySynthesis

Adversarially Robust Neural Network on MNIST.
Python
64
star
11

game-of-noise

Trained model weights, training and evaluation code from the paper "A simple way to make neural networks robust against diverse image corruptions"
Python
62
star
12

decompose

Blind source separation based on the probabilistic tensor factorisation framework
Python
43
star
13

adversarial-vision-challenge

NIPS Adversarial Vision Challenge
Python
41
star
14

CiteME

CiteME is a benchmark designed to test the abilities of language models in finding papers that are cited in scientific texts.
Python
35
star
15

InDomainGeneralizationBenchmark

Python
33
star
16

robust-vision-benchmark

Robust Vision Benchmark
Python
22
star
17

docker

Information and scripts to run and develop the Bethge Lab Docker containers
Makefile
20
star
18

slurm-monitoring-public

Monitor your high performance infrastructure configured over slurm using TIG stack
Python
19
star
19

google_scholar_crawler

Crawl Google scholar publications and authors
Python
12
star
20

DataTypeIdentification

Code for the ICLR'24 paper: "Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models"
11
star
21

magapi-wrapper

Wrapper around Microsoft Academic Knowledge API to retrieve MAG data
Python
10
star
22

testing_visualizations

Code for the paper " Exemplary Natural Images Explain CNN Activations Better than Feature Visualizations"
Python
10
star
23

docker-deeplearning

Development of new unified docker container (WIP)
Python
9
star
24

sort-and-search

Code for the paper: "Efficient Lifelong Model Evaluation in an Era of Rapid Progress" [NeurIPS'24]
Python
9
star
25

notorious_difficulty_of_comparing_human_and_machine_perception

Code for the three case studies: Closed Contour Detection, Synthetic Visual Reasoning Test, Recognition Gap
Jupyter Notebook
8
star
26

lifelong-benchmarks

Benchmarks introduced in the paper: "Lifelong Benchmarks: Efficient Model Evaluation in an Era of Rapid Progress"
8
star
27

tools

Shell
6
star
28

docker-jupyter-deeplearning

Docker Image with Jupyter for Deep Learning (Caffe, Theano, Lasagne, Keras)
6
star
29

docker-xserver

Docker Image with Xserver, OpenBLAS and correct user settings
Shell
2
star
30

gym-Atari-SpaceInvaders-V0

Python
1
star
31

bwki-weekly-tasks

BWKI Task of the week
Jupyter Notebook
1
star