• Stars
    star
    335
  • Rank 125,113 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[CVPR 2023] Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners

Prompt, Generate, then Cache

Official implementation of 'Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners'.

The paper has been accepted by CVPR 2023 🔥.

News

Introduction

We propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre-trianing paradigms for better few-shot learning, including CLIP, DINO, DALL-E, and GPT-3. Specifically, CaFo works by `Prompt, Generate, then Cache'. We leverage GPT-3 to prompt CLIP with rich linguistic semantics and generate synthetic images via DALL-E to expand the few-shot training data. Then, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such collaboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of-the-art for few-shot classification.

Requirements

Installation

Create a conda environment and install dependencies:

git clone https://github.com/ZrrSkywalker/CaFo.git
cd CaFo

conda create -n cafo python=3.7
conda activate cafo

pip install -r requirements.txt

# Install the according versions of torch and torchvision
conda install pytorch torchvision cudatoolkit

Dataset

Please follow DATASET.md to download official ImageNet and other 10 datasets.

Foundation Models

  • The pre-tained weights of CLIP will be automatically downloaded by running.
  • The prompts produced by GPT-3 have been stored at gpt_file/.
  • Please download DINO's pre-trained ResNet-50 from here, and put it under dino/.
  • Please download DALL-E's generated images from here, and organize them with the official datasets like
$DATA/
|–– imagenet/
|–– caltech-101/
|–– oxford_pets/
|–– ...
|–– dalle_imagenet/
|–– dalle_caltech-101/
|–– dalle_oxford_pets/
|–– ...
|–– sd_caltech-101/
  • For Caltech-101 dataset, we also provide Stable Diffusion's images from here, and ChatGPT's prompts in gpt_file/.

Get Started

Configs

The running configurations for different [dataset] with [k] shots can be modified in configs/[dataset]/[k]shot.yaml, including visual encoders and hyperparamters. We have provided the configurations for reproducing the results in the paper. You can edit the search_scale, search_step, init_beta and init_alpha for fine-grained tuning and better results.

Note that the default load_cache and load_pre_feat are False for the first running, which will store the cache model and val/test features in configs/dataset/. For later running, they can be set as True for faster hyperparamters tuning.

For Caltech101 dataset, the config of Stable Diffusion's images and ChatGPT's prompts is respectively in configs/sd_caltech101 and configs/chat_caltech101.

Running

For 16-shot ImageNet dataset:

CUDA_VISIBLE_DEVICES=0 python main_imagenet.py --config configs/imagenet/16shot.yaml

For other 10 datasets:

CUDA_VISIBLE_DEVICES=0 python main.py --config configs/dataset/16shot.yaml

Numerical Results

We provide CaFo's numerical results on 11 datasets from 1 to 16 shots at exp_Cafo.log. The results for Tip-Adapter and Tip-Adapter-F is at exp_Tip.log.

Acknowledgement

This repo benefits from Tip-Adapter, CLIP, DINO, DALL-E and CuPL. Thanks for their wonderful works.

Citation

@article{zhang2023prompt,
  title={Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners},
  author={Renrui Zhang and Xiangfei Hu and Bohao Li and Siyuan Huang and Hanqiu Deng and Hongsheng Li and Yu Qiao and Peng Gao},
  journal={arXiv preprint arXiv:2303.02151},
  year={2023}
}

Contributors

Renrui Zhang, Xiangfei Hu, Bohao Li

Contact

If you have any question about this project, please feel free to contact [email protected] and [email protected].

More Repositories

1

LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
Python
5,618
star
2

InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Python
5,059
star
3

DragGAN

Unofficial Implementation of DragGAN - "Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold" (DragGAN 全功能实现,在线Demo,本地部署试用,代码、模型已全部开源,支持Windows, macOS, Linux)
Python
4,996
star
4

InternGPT

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)
Python
3,178
star
5

Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
Python
2,943
star
6

InternImage

[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Python
2,474
star
7

InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Python
1,258
star
8

VisionLLM

VisionLLM Series
Python
801
star
9

VideoMamba

[ECCV2024] VideoMamba: State Space Model for Efficient Video Understanding
Python
755
star
10

OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Python
643
star
11

VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Python
470
star
12

DCNv4

[CVPR 2024] Deformable Convolution v4
Python
463
star
13

GITM

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
445
star
14

all-seeing

[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
Python
436
star
15

Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Python
428
star
16

PonderV2

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
Python
311
star
17

Vision-RWKV

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Python
297
star
18

LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents
Python
286
star
19

UniFormerV2

[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Python
280
star
20

unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Python
276
star
21

Instruct2Act

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
Python
223
star
22

HumanBench

This repo is official implementation of HumanBench (CVPR2023)
Python
218
star
23

OmniCorpus

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
HTML
210
star
24

gv-benchmark

General Vision Benchmark, GV-B, a project from OpenGVLab
Python
188
star
25

ControlLLM

ControlLLM: Augment Language Models with Tools by Searching on Graphs
Python
181
star
26

InternVideo2

152
star
27

UniHCP

Official PyTorch implementation of UniHCP
Python
145
star
28

efficient-video-recognition

Python
114
star
29

SAM-Med2D

Official implementation of SAM-Med2D
Jupyter Notebook
114
star
30

EgoVideo

[CVPR 2024 Champions] Solutions for EgoVis Chanllenges in CVPR 2024
Jupyter Notebook
96
star
31

DiffRate

[ICCV 23]An approach to enhance the efficiency of Vision Transformer (ViT) by concurrently employing token pruning and token merging techniques, while incorporating a differentiable compression rate.
Jupyter Notebook
78
star
32

Awesome-DragGAN

Awesome-DragGAN: A curated list of papers, tutorials, repositories related to DragGAN
75
star
33

MM-NIAH

This is the official implementation of the paper "Needle In A Multimodal Haystack"
Python
70
star
34

M3I-Pretraining

69
star
35

STM-Evaluation

Python
69
star
36

MMT-Bench

ICML'2024 | MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Python
62
star
37

MUTR

[AAAI 2024] Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
Python
60
star
38

ChartAst

ChartAssistant is a chart-based vision-language model for universal chart comprehension and reasoning.
Python
60
star
39

LCL

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Python
57
star
40

DDPS

Official Implementation of "Denoising Diffusion Semantic Segmentation with Mask Prior Modeling"
Python
53
star
41

LORIS

Long-Term Rhythmic Video Soundtracker, ICML2023
Python
52
star
42

Awesome-LLM4Tool

A curated list of the papers, repositories, tutorials, and anythings related to the large language models for tools
52
star
43

GUI-Odyssey

GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos.
Python
47
star
44

PIIP

Parameter-Inverted Image Pyramid Networks (PIIP)
Python
44
star
45

InternVL-MMDetSeg

Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
Jupyter Notebook
40
star
46

Siamese-Image-Modeling

[CVPR 2023]Implementation of Siamese Image Modeling for Self-Supervised Vision Representation Learning
Python
32
star
47

Multitask-Model-Selector

Implementation of Foundation Model is Efficient Multimodal Multitask Model Selector
Python
27
star
48

De-focus-Attention-Networks

Learning 1D Causal Visual Representation with De-focus Attention Networks
Python
27
star
49

Official-ConvMAE-Det

Python
13
star
50

opengvlab.github.io

12
star
51

MovieMind

9
star
52

perception_test_iccv2023

Champion Solutions repository for Perception Test challenges in ICCV2023 workshop.
Python
9
star
53

EmbodiedGPT

5
star
54

DriveMLM

3
star
55

.github

2
star