Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.
DeCLIP is an open-source project that welcomes any contribution and feedback. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible as well as a standardized toolkit to reimplement existing methods and develop their own new Contrastive Language-Image Pretraining methods. You can find the following things in this repo:
- Pre-trained models and training codes to reproduce various Contrastive Language-Image Pretraining methods(e.g. CLIP, DeCLIP, SLIP, FILIP).
- Various benchmark datasets for Large-scale Contrastive Language-Image Pretraining task.
- Zero-shot transfer and linear classification evaluation scripts for downstream datasets.
We aims to democratize large-scale CLIP to build a fair and reproducible CLIP community. Our paper are available on:
DeCLIP: Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm.
CLIP-Benchmark: Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision.
Call for Papers & Participation
[Workshop] | [IC Challenge] | [OD Challenge] |
Introduction
Recently, large-scale Contrastive Language-Image Pre-training (CLIP) (Radfordet al., 2021) has attracted unprecedented attention for its impressive zero-shot recognition ability and excellent transferability to downstream tasks. However, CLIP is quite data-hungry and requires 400M image-text pairs for pre-training, thereby restricting its adoption. This work proposes a novel training paradigm, Data efficient CLIP (DeCLIP), to alleviate this limitation. We demonstrate that by carefully utilizing the widespread supervision among the image-text pairs, our DeCLIP can learn generic visual features more efficiently. Instead of using the single image-text contrastive supervision, we fully exploit data potential through the use of (1) self-supervision within each modality; (2) multi-view supervision across modalities; (3) nearest-neighbor supervision from other similar pairs. Benefiting from these intrinsic supervision, our DeCLIP-ResNet50 can achieve 60.4% zero-shot top1 accuracy on ImageNet, which is 0.8% above the CLIP-ResNet50 while using 7.1× fewer data. Our DeCLIP-ResNet50 outperforms its counterpart in 8 out of 11 visual datasets when transferred to downstream tasks. Moreover, Scaling up the model and computing also works well in our framework.
Updates
2022-09-19
2022-06-25 We release the checkpoints of each models for benchmark.
2022-03-10 We update the result of CLIP-Benchmark and release our YFCC15M dataset.
2022-02-22 We release our training code, benchmark, and model zoo! We will release the checkpoints of each models after align the results soon. We hope this project could serve the growing Contrastive Language-Image Pretraining research community by providing a flexible as well as standardized toolkit.
2021-11-06 First Commit, Our code, dataset and models will be relased soon.
Installation
Please refer to get_started.md for installation and dataset_prepare.md for dataset preparation.
Get Started
Install PyTorch. The code has been tested with CUDA 11.2/CuDNN 8.1.0, PyTorch 1.8.1.
First, prepare pre-training datasets and downstream classification datasets through get_started.md.
We organize the different models trained on different data through separate [experimental catalogs] (experiments/), you can check the dir for detail.
1. Pre-training
You can run run.sh
directly to train the corresponding model. We train most of our models on 4x8-gpu nodes. Check the config in the experiment directory of the corresponding model for details.
2. Zero-shot Evalution
You can add a argument --evaluate
on run script for zero-shot evalution.
DeCLIP Model-Zoo
Our pretrain visual backbone model (w/o text encoder)
Method | Dataset | Model | Epochs | 0-shot | Config | Paper | Weights |
---|---|---|---|---|---|---|---|
DeCLIP | Declip-88M | ResNet50 | 32 | 62.5 | config | paper | GoogleDriver |
DeCLIP | Declip-88M | ViT-B32 | 32 | 66.2 | config | paper | GoogleDriver |
Our pretrain declip model (w text encoder)
Method | Dataset | Model | Epochs | 0-shot | Config | Paper | Weights |
---|---|---|---|---|---|---|---|
DeCLIP | Declip-88M | ResNet50 | 32 | 62.5 | config | paper | GoogleDriver |
DeCLIP | Declip-88M | ViT-B32 | 32 | 66.2 | config | paper | GoogleDriver |
CLIP-Benchmark
Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision. Our paper is available on Arxiv.
Witnessing its great success, researchers continue to push the frontier of CLIP. For instance, SLIP, DeCLIP and FILIP achieve considerable improvements via embracing different kinds of supervision within the image-text pairs. However, it remains challenging to make fair comparison between these methods. This is because they do not choose consistent training recipes and even use different data. We propose CLIP-benchmark, a first attempt to evaluate, analyze, and benchmark CLIP and its variants. Moreover, we further combine DeCLIP with FILIP, bringing us the strongest variant DeFILIP.
Supported Models:
The following models are pre-trained on YFCC15M and evaluated on ImageNet-1K (ILSVRC2012).
Method | Dataset | Model | Epochs | 0-shot | Config | Paper | Weights |
---|---|---|---|---|---|---|---|
CLIP | YFCC-15M | ViT-B32 | 32 | 32.8 | config | paper | GoogleDriver |
DeCLIP | YFCC-15M | ViT-B32 | 32 | 43.2 | config | paper | GoogleDriver |
SLIP | YFCC-15M | ViT-B32 | 32 | 34.3 | config | paper | GoogleDriver |
FILIP | YFCC-15M | ViT-B32 | 32 | 39.5 | config | paper | GoogleDriver |
DeFILIP | YFCC-15M | ViT-B32 | 32 | 45.0 | config | paper | GoogleDriver |
Method | Dataset | Model | Epochs | 0-shot | Config | Paper | Weights |
---|---|---|---|---|---|---|---|
CLIP | YFCC-15M | ResNet50 | 32 | 37.2 | config | paper | GoogleDriver |
DeCLIP | YFCC-15M | ResNet50 | 32 | 44.4 | config | paper | GoogleDriver |
SLIP | YFCC-15M | ResNet50 | 32 | 28.5 | config | paper | -- |
FILIP | YFCC-15M | ResNet50 | 32 | 21.3 | config | paper | -- |
Supported datasets:
Dataset | Samples | download | Paper |
---|---|---|---|
YFCC-15M | 15,388,848 | google driver | url |
Changelog
2022-02-22 Realase our Training code
2021-11-06 First Commit
Citation
@inproceedings{li2022supervision,
title={Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm},
author={Yangguang Li and Feng Liang and Lichen Zhao and Yufeng Cui and Wanli Ouyang and Jing Shao and Fengwei Yu and Junjie Yan},
booktitle={International Conference on Learning Representations},
year={2022},
url={https://openreview.net/forum?id=zq1iJkNk3uN}
}
@misc{cui2022democratizing,
title={Democratizing Contrastive Language-Image Pre-training: A CLIP Benchmark of Data, Model, and Supervision},
author={Yufeng Cui and Lichen Zhao and Feng Liang and Yangguang Li and Jing Shao},
year={2022},
eprint={2203.05796},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
License
For academic use, this project is licensed under the 2-clause BSD License. For commercial use, please contact the authors.
Acknowledgement
DeCLIP is an open-source project that welcomes any contribution and feedback. We wish that the toolbox and benchmark could serve the growing research community by providing a flexible as well as a standardized toolkit to reimplement existing methods and develop their own new Contrastive Language-Image Pretraining methods.
Our framework is based on prototype.