Prompt, Generate, then Cache
Official implementation of 'Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners'.
The paper has been accepted by CVPR 2023
News
- Please check our latest work 'Point-NN, Parameter is Not All You Need' with code, accepted by CVPR 2023
🔥 , which conducts 3D understanding without ant parameters or training. - CaFo cascaded with ChatGPT and Stable Diffusion on Caltech-101 dataset has been released 📌.
- The code of CaFo has been released.
- The CaFo model is developed based on Tip-Adapter, accepted by ECCV 2022 and open-sourced.
Introduction
We propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-trianing paradigms for better few-shot learning, including CLIP, DINO, DALL-E, and GPT-3. Specifically, CaFo works by `Prompt, Generate, then Cache'. We leverage GPT-3 to prompt CLIP with rich linguistic semantics and generate synthetic images via DALL-E to expand the few-shot training data. Then, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification.
Requirements
Installation
Create a conda environment and install dependencies:
git clone https://github.com/ZrrSkywalker/CaFo.git
cd CaFo
conda create -n cafo python=3.7
conda activate cafo
pip install -r requirements.txt
# Install the according versions of torch and torchvision
conda install pytorch torchvision cudatoolkit
Dataset
Please follow DATASET.md to download official ImageNet and other 10 datasets.
Foundation Models
- The pre-tained weights of CLIP will be automatically downloaded by running.
- The prompts produced by GPT-3 have been stored at
gpt_file/
. - Please download DINO's pre-trained ResNet-50 from here, and put it under
dino/
. - Please download DALL-E's generated images from here, and organize them with the official datasets like
$DATA/
|–– imagenet/
|–– caltech-101/
|–– oxford_pets/
|–– ...
|–– dalle_imagenet/
|–– dalle_caltech-101/
|–– dalle_oxford_pets/
|–– ...
|–– sd_caltech-101/
- For Caltech-101 dataset, we also provide Stable Diffusion's images from here, and ChatGPT's prompts in
gpt_file/
.
Get Started
Configs
The running configurations for different [dataset]
with [k]
shots can be modified in configs/[dataset]/[k]shot.yaml
, including visual encoders and hyperparamters. We have provided the configurations for reproducing the results in the paper. You can edit the search_scale
, search_step
, init_beta
and init_alpha
for fine-grained tuning and better results.
Note that the default load_cache
and load_pre_feat
are False
for the first running, which will store the cache model and val/test features in configs/dataset/
. For later running, they can be set as True
for faster hyperparamters tuning.
For Caltech101 dataset, the config of Stable Diffusion's images and ChatGPT's prompts is respectively in configs/sd_caltech101
and configs/chat_caltech101
.
Running
For 16-shot ImageNet dataset:
CUDA_VISIBLE_DEVICES=0 python main_imagenet.py --config configs/imagenet/16shot.yaml
For other 10 datasets:
CUDA_VISIBLE_DEVICES=0 python main.py --config configs/dataset/16shot.yaml
Numerical Results
We provide CaFo's numerical results on 11 datasets from 1 to 16 shots at exp_Cafo.log. The results for Tip-Adapter and Tip-Adapter-F is at exp_Tip.log.
Acknowledgement
This repo benefits from Tip-Adapter, CLIP, DINO, DALL-E and CuPL. Thanks for their wonderful works.
Citation
@article{zhang2023prompt,
title={Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners},
author={Renrui Zhang and Xiangfei Hu and Bohao Li and Siyuan Huang and Hanqiu Deng and Hongsheng Li and Yu Qiao and Peng Gao},
journal={arXiv preprint arXiv:2303.02151},
year={2023}
}
Contributors
Renrui Zhang, Xiangfei Hu, Bohao Li
Contact
If you have any question about this project, please feel free to contact [email protected] and [email protected].