• Stars
    star
    201
  • Rank 194,491 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created 12 months ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?

This work delves into an essential, yet must-know baseline in light of the latest advancements in Generative Artificial Intelligence (GenAI): the utilization of GPT-4 for visual understanding. We center on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks. To ensure a comprehensive evaluation, we have conducted experiments across three modalities—images, videos, and point clouds—spanning a total of 16 popular academic benchmark.

📣 I also have other cross-modal projects that may interest you ✨.

Revisiting Classifier: Transferring Vision-Language Models for Video Recognition
Wenhao Wu, Zhun Sun, Wanli Ouyang
Conference Journal github

Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models
Wenhao Wu, Xiaohan Wang, Haipeng Luo, Jingdong Wang, Yi Yang, Wanli Ouyang
Conference github

Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
Wenhao Wu, Haipeng Luo, Bo Fang, Jingdong Wang, Wanli Ouyang
Accepted by CVPR 2023 as 🌟Highlight🌟 | Conference github

News

  • [Nov 28, 2023] We release our report in Arxiv.
  • [Nov 27, 2023] Our prompts have been released. Thanks for your star 😝.

Overview

An overview of 16 evaluated popular benchmark datasets, comprising images, videos, and point clouds.

Zero-shot visual recognition leveraging GPT-4's linguistic and visual capabilities.

Generated Descriptions from GPT-4

  • We have pre-generated descriptive sentences for all the categories across the datasets, which you can find in the GPT_generated_prompts folder. Enjoy exploring!

  • We've also provided the example script to help you generate descriptions using GPT-4. For guidance on this, please refer to the generate_prompt.py file. Happy coding! Please refer to the config folder for detailed information on all datasets used in our project.

  • Execute the following command to generate descriptions with GPT-4.

    # To run the script for specific dataset, simply update the following line with the name of the dataset you're working with: 
    # dataset_name = ["Dataset Name Here"]   # e.g., dtd
    python generate_prompt.py

GPT-4V(ision) for Visual Recognition

  • We share an example script that demonstrates how to use the GPT-4V API for zero-shot predictions on the DTD dataset. Please refer to the GPT4V_ZS.py file for a step-by-step guide on implementing this. We hope it helps you get started with ease!

    # GPT4V zero-shot recognition script. 
    # dataset_name = ["Dataset Name Here"]   # e.g., dtd
    python GPT4V_ZS.py
    
    # We also provide a script for batch testing with each request (larger batch sizes may lead to instability).
    python GPT4V_ZS_batch.py

Requirement

For guidance on setting up and running the GPT-4 API, we recommend checking out the official OpenAI Quickstart documentation available at: OpenAI Quickstart Guide.

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the results, please star 🌟 this repo and use the following BibTeX 📑 entry.

@article{GPT4Vis,
  title={GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?},
  author={Wu, Wenhao and Yao, Huanjin and Zhang, Mengxi and Song, Yuxin and Ouyang, Wanli and Wang, Jingdong},
  booktitle={arXiv preprint arXiv:2311.15732},
  year={2023}
}

🎗️ Acknowledgement

This evaluation is built on the excellent works:

  • CLIP: Learning Transferable Visual Models From Natural Language Supervision
  • GPT-4
  • Text4Vis: Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective

We extend our sincere gratitude to these contributors.

👫 Contact

For any questions, please feel free to file an issue.