• Stars
    star
    240
  • Rank 168,229 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 2 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[CVPR 2023] Official repository of paper titled "Fine-tuned CLIP models are efficient video learners".

Fine-tuned CLIP models are efficient video learners [CVPR 2023]

Fine-tuned CLIP models are efficient video learners
Hanoona Rasheed*, Muhammad Uzair Khattak*, Muhammad Maaz, Salman Khan, Fahad Shahbaz Khan

*Equally contributing first authors

Website paper video slides

Official implementation of the paper "Fine-tuned CLIP models are efficient video learners".


🚀 News

  • (Feb 28, 2023)
    • Paper accepted at CVPR 2023 🎉
  • (Dec 6, 2022)
    • Training and evaluation codes for ViFi-CLIP, along with pretrained models are released.

Highlights

main figure

This work explores the capability of a simple baseline called ViFi-CLIP (Video Fine-tuned CLIP) for adapting image pretrained CLIP to video domain. The figure compares the zero-shot performance of vanilla CLIP and several of its variants adapted for videos (trained on Kinetics-400, evaluated on UCF-101 and HMDB-51). The t-SNE visualizations of video-embeddings obtained from ViFi-CLIP (4th col.) are compared with embeddings from vanilla CLIP (1st col.), individually tuned CLIP text (2nd col.) and image encoder (3rd col.) on videos, and recent state-of-the-art work, XCLIP (last col.) (∆ represents difference over XCLIP). The embeddings of ViFi-CLIP are better separable, indicating that a simple fine-tuning of CLIP is sufficient to learn suitable video-specific inductive biases, and can perform competitive to more complex approaches having dedicated components designed to model temporal information in videos.

Abstract: Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos , they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks.

Main Contributions

  1. ViFi-CLIP: We formulate and show the significance of an often neglected but simple baseline for transferring image-based CLIP model to video domain. ViFi-CLIP (Video Fine-tuned CLIP) shows that a simple fine-tuning of CLIP is sufficient to learn suitable video-specific inductive biases, and can perform competitive to more complex approaches having dedicated components designed to model temporal information in videos.
  2. Base-to-novel generalization benchmark: We introduce base-to-novel generalization benchmark for video-domain for evaluating the generalization ability of models for video action recognition.
  3. Bridge and Prompt approach: We show the effectiveness of our proposed ‘bridge and prompt’ approach to first bridge the modality gap through fine-tuning followed by prompt learning in both visual and language branches of the CLIP model for low-data regimes.

Model Zoo

NOTE: All models in our experiments below uses publicly available ViT/B-16 based CLIP model. The trained model weights against each experiment is provided in tables below.

Zero-shot results

All models are trained on Kinetics-400 and then evaluated directly on downstream datasets.

Name (configs) Input HMDB-51 UCF-101 Kinetics-600 Model
CLIP image-FT 32x224 49.0 72.9 62.2 link
CLIP text-FT 32x224 48.5 69.8 68.5 link
ViFi-CLIP 32x224 51.3 76.8 71.2 link

Base-to-novel generalization results

Here, we divide each dataset into base and novel classes. All models are trained on base classes and evaluated on both base and novel classes. Results are averaged over 3 seeds for each experiment.

Kinetics-400

Name (configs) Input Base Acc. Novel Acc. HM Model
CLIP image-FT 32x224 72.9 58.0 64.6 seed1/seed2/seed3
CLIP text-FT 32x224 73.4 59.7 65.8 seed1/seed2/seed3
ViFi-CLIP 32x224 76.4 61.1 67.9 seed1/seed2/seed3

HMDB-51

Name (configs) Input Base Acc. Novel Acc. HM Model
CLIP image-FT 32x224 62.6 47.5 54.0 seed1/seed2/seed3
CLIP text-FT 32x224 70.0 51.2 59.1 seed1/seed2/seed3
ViFi-CLIP 32x224 73.8 53.3 61.9 seed1/seed2/seed3

UCF-101

Name (configs) Input Base Acc. Novel Acc. HM Model
CLIP image-FT 32x224 86.4 65.3 74.4 seed1/seed2/seed3
CLIP text-FT 32x224 90.9 67.4 77.4 seed1/seed2/seed3
ViFi-CLIP 32x224 92.9 67.7 78.3 seed1/seed2/seed3

SSv2

Name (configs) Input Base Acc. Novel Acc. HM Model
CLIP image-FT 32x224 9.2 8.5 8.8 seed1/seed2/seed3
CLIP text-FT 32x224 12.4 9.5 10.8 seed1/seed2/seed3
ViFi-CLIP 32x224 16.2 12.1 13.9 seed1/seed2/seed3

VL Prompting approach: Base-to-Novel

ViFi-CLIP is first trained on K400 and then vision and language prompts are further fine-tuned on the downstream datasets.

Dataset (configs) Input Base Acc. Novel Acc. HM Model
HMDB-51 32x224 77.1 54.9 64.1 seed1/seed2/seed3
UCF-101 32x224 95.9 74.1 83.6 seed1/seed2/seed3
SSv2 32x224 15.8 11.5 13.3 seed1/seed2/seed3

Few-shot results

Below table shows few-shot results of ViFi-CLIP for K=2, 4, 8 and 16.

Name (configs) Dataset K (shots) Input Top-1 Acc. Model
ViFi-CLIP HMDB-51 2 32x224 57.2 link
ViFi-CLIP HMDB-51 4 32x224 62.7 link
ViFi-CLIP HMDB-51 8 32x224 64.5 link
ViFi-CLIP HMDB-51 16 32x224 66.8 link
ViFi-CLIP UCF-101 2 32x224 80.7 link
ViFi-CLIP UCF-101 4 32x224 85.1 link
ViFi-CLIP UCF-101 8 32x224 90.0 link
ViFi-CLIP UCF-101 16 32x224 92.7 link
ViFi-CLIP SSv2 2 32x224 6.2 link
ViFi-CLIP SSv2 4 32x224 7.4 link
ViFi-CLIP SSv2 8 32x224 8.5 link
ViFi-CLIP SSv2 16 32x224 12.4 link

NOTE: Few-shot results for other CLIP Fine-tuned variants are presented in our main paper (Table 3). Model weights for other variants are provided here.

VL Prompting approach: Few-shot

ViFi-CLIP is first trained on K400 and then vision and language prompts are further fine-tuned on the downstream datasets in few-shot manner.

Dataset (configs) Input K=2 K=4 K=8 K=16 Model
HMDB-51 32x224 63.0 65.1 69.6 72.0 K=2/K=4/K=8/K=16
UCF-101 32x224 91.0 93.7 95.0 96.4 K=2/K=4/K=8/K=16
SSv2 32x224 6.7 7.9 10.2 13.5 K=2/K=4/K=8/K=16

Fully-supervised results on Kinetics-400

Name (configs) FLOPS(G) Input Top-1 Acc. Top-5 Acc. Model
CLIP image-FT 281 16x224 82.8 96.2 link
CLIP text-FT 281 16x224 73.1 91.2 link
ViFi-CLIP 281 16x224 83.9 96.3 link

Installation

For installation and other package requirements, please follow the instructions detailed in INSTALL.md.

Data preparation

Please follow the instructions at DATASETS.md to prepare all datasets.

Training

For all experiments shown in above tables, we provide config files in configs folder. For example, to train ViFi-CLIP (tunes both image and text encoder) on Kinetics-400, run the following command:

python -m torch.distributed.launch --nproc_per_node=8 \ 
main.py -cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT 

Note:

  • We recommend keeping the total batch size as mentioned in respective config files. Please use --accumulation-steps to maintain the total batch size. Specifically, here the effective total batch size is 8(GPUs_NUM) x 4(TRAIN.BATCH_SIZE) x 16(TRAIN.ACCUMULATION_STEPS) = 512.
  • After setting up the datasets as instructed DATASETS.md, only argument in the config file that should be specified is data path. All other settings in config files are pre-set.

For detailed training instructions for all experimental setup, please refer to TRAIN.md.

Evaluating models

To evaluate a model, please use a suitable config and corresponding model weights. For example, to evaluate ViFi-CLIP with 16 frames on Kinetics-400, run the command below:

python -m torch.distributed.launch --nproc_per_node=8 main.py \
-cfg configs/fully_supervised/k400/16_16_vifi_clip.yaml --output /PATH/TO/OUTPUT \
--only_test --resume /PATH/TO/CKPT --opts TEST.NUM_CLIP 4 TEST.NUM_CROP 3

Contact

If you have any questions, please create an issue on this repository or contact at [email protected] or [email protected] .

Citation

If you use our approach (code, model or dataset splits) in your research, please consider citing:

@inproceedings{hanoonavificlip,
    title={Finetuned CLIP models are efficient video learners},
    author={Rasheed, Hanoona and khattak, Muhammad Uzair and Maaz, Muhammad and Khan, Salman and Khan, Fahad Shahbaz},
    booktitle={The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year={2023}
}

Acknowledgements

Our code is based on XCLIP's repository. We sincerely thank the authors for releasing their code. If you use our model and code, please consider citing XCLIP as well.

More Repositories

1

multimodal-prompt-learning

[CVPR 2023] Official repository of paper titled "MaPLe: Multi-modal Prompt Learning".
Python
629
star
2

PromptSRC

[ICCV'23 Main Track, WECIA'23 Oral] Official repository of paper titled "Self-regulating Prompts: Foundational Model Adaptation without Forgetting".
Python
220
star
3

ProText

[CVPRW 2024] Official repository of paper titled "Learning to Prompt with Text Only Supervision for Vision-Language Models".
Python
80
star
4

transformers-transforming-vision

Validating image classification benchmark results on ViTs and ResNets (v2)
Python
13
star
5

OD-Satellite-iSAID

Object detection with Satellite Images
Python
10
star
6

ImageRecognition-NVIDIA-Jetson

PyTorch scripts of ResNet50: performance metrics are evaluated on Jetson Nano and on my GTX 1060 powered laptop (Asus GL702VM)
Python
4
star
7

facial-mask-detector-MTCNN

Pytorch based custom NN and tensorflow based MTCNN face detection algorithm Facial Mask Detector
Jupyter Notebook
3
star
8

ConvNext-FGVC

Benchmarking ConvNeXts for FGVC datasets
Python
3
star
9

muzairkhattak.github.iouzair

My personal website
HTML
3
star
10

CLIP_contextualAnalysis

Experimentation on CLIP's classification behaviour by changing prompts and contextual information
Jupyter Notebook
3
star
11

Machine-Learning-Course-Assigments-CS-470-

Solved coding assignments for my CS470 ML course (during my bachelors)
Jupyter Notebook
2
star
12

muzairkhattak.github.io

HTML
2
star
13

Image-Stitching-Results

Stiched WSI images using our custom Image Stitching software
2
star
14

proposals_visualizer_fasterrcnn

Jupyter notebook for visualizing RPN proposals of a trained Faster RCNN on given sample images
Jupyter Notebook
2
star
15

PyTorch

Jupyter Notebook
1
star
16

Plant_pathology-Kaggle-Competition

fastai dense-net and pytorch resnet model implementation which gives upto 96% of test accuracy on the test data of Kaggle plants-pathology competition
Jupyter Notebook
1
star
17

muzairkhattak

1
star
18

Coursera-deeplearning.ai-assigments

Solved assignments of the Andrew Ng's Deep learning Specialization course on Coursera/deeplearning.ai
Jupyter Notebook
1
star
19

Digital-Signal-Processing_MATLAB

Different Digital Signal processing codes and algorithms implemented in MATLAB
MATLAB
1
star
20

chatbot-PyTorch-

Uploaded a jupyter notebook in which a chatbot is implemented using PyTorch Framework.
Jupyter Notebook
1
star
21

ML-Algorithms-using-Libraries-

This repo consists of ML algorithms using popular ML/DL Libraries
Jupyter Notebook
1
star
22

Linear-Control-System-using-MATLAB

I will upload MATLAB codes for performing different Linear Control systems functions , and applications
MATLAB
1
star
23

PAK-COVID-19-Citywise-and-District-wise-analysis

A python based script in which PAK-COVID-19 dataset from Kaggle is used for Records Visualization and preprocessing of Data..
Jupyter Notebook
1
star
24

TensorRT-with-PyTorch

Complete code files, with proper documentation on how to use PyTorch Models with TensorRT optimization for PCs, Jetson TX2 and Jetson Nano
Python
1
star