• Stars
    star
    781
  • Rank 58,232 (Top 2 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[A toolbox for fun.] Transform Image into Unique Paragraph with ChatGPT, BLIP2, OFA, GRIT, Segment Anything, ControlNet.

Your Image Description

Image.txt: Transform Image Into Unique Paragraph

Hugging Face Spaces, , Open In Colab

Project Website

(huggingface sometimes may not work with safari, use chrome)

Demo

Your Image Description News

  • 17/April/2023. In addition to semantic segment anything, we use Edit Anything to get region-level semantic. Now all models takes less than 20s on 8G memory GPU card. (10times faster than previous version on cpu)
  • 17/April/2023. Our project is online on Huggingface. Have a try! huggingface
  • 14/April/2023. Our project is very popular in twitter. Looking the posted twitter for details.

(Can run on 8GB memory GPU within 20S!)

Your Image Description

Main Pipeline

Your Image Description

Reasoning Details

Your Image Description

To Do List

Done

  • GRIT example.
  • ControNet, BLIP2.
  • Semantic Segment Anything.
  • Segment Anything for fine-grained semantic.
  • Gradio.
  • Integrate GRIT into our code.
  • Support GPT4 API.
  • Notebook/Huggingface Space.
  • Region Semantic Classification from Edit-Anything.
  • Make the model lightweight.

Doing

  • Replace ChatGPT with own trained LLM.
  • Other grounding text2image model as instead of Canny ControlNet.
  • Show retrieval result in gradio.

Visualization

The text to image model is conrolnet with canny from diffuser.

Your Image Description

Your Image Description

Your Image Description

Installation

Please find installation instructions in install.md.

2. Start

Simple visualization

export OPENAI_KEY=[YOUR KEY HERE]
python main.py  --image_src [image_path] --out_image_name [out_file_name]

If your GPU memory smaller than 8 GPB.

python main.py --image_caption_device cpu --semantic_segment_device cpu

If you have no GPU available.

python main.py --image_caption_device cpu --semantic_segment_device cpu --dense_caption_device cpu  --contolnet_device cpu

like

python main.py --image_src "examples/3.jpg" --out_image_name "output/3_result.jpg"

Note: If you have GPU card with larger memory than 15GB. Set all device to GPU for fast inference.

The generated text and image are show in "output/".

Note: Use GPT4 for good result as GPT3.5 miss the position information sometime.

Use gradio directly

python main_gradio.py

If you have GPU Memory larger than 20GB. Use device='cuda' as default.

3. Visualization

Your Image Description A dog sitting on a porch with a bike. Your Image Description Your Image Description
Input BLIP2 Image Caption GRIT Dense Caption Semantic Segment Anything

The final generated paragraph with ChatGPT is:

  This image depicts a black and white dog sitting on a porch beside a red bike. The dense caption mentions other objects in the scene, such as a white car parked on the street and a red bike parked on the side of the road. The region semantic provides more specific information, including the porch, floor, wall, and trees. The dog can be seen sitting on the floor beside the bike, and there is also a parked bicycle and tree in the background. The wall is visible on one side of the image, while the street and trees can be seen in the other direction. 

4. Retrieval Result on COCO

Method Trainable Parameter Running Time IR@1 TR@1
Image-text 230M 9H 43.8 33.2
Generated Paragraph-text 0 5m 49.7 36.1

Interesting, we find compress image into paragraph. The retrieval result is even better than use source image.

Others

If you have more suggestions or functions need to be implemented in this codebase, feel free to drop me an email awinyimg dot gmail dot com or open an issue.

Acknowledgment

This work is based on ChatGPT, Edit_Anything, BLIP2, GRIT, OFA,Segment-Anything, Semantic-Segment-Anything, ControlNet.

More Repositories

1

Awesome-Video-Diffusion

A curated list of recent diffusion models for video generation, editing, restoration, understanding, etc.
3,195
star
2

Show-1

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation
Python
1,089
star
3

Tune-A-Video

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Python
1,010
star
4

MotionDirector

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.
Python
747
star
5

Show-o

Repository for Show-o, One Single Transformer to Unify Multimodal Understanding and Generation.
Python
684
star
6

VideoSwap

Code for [CVPR 2024] VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence
342
star
7

Awesome-MLLM-Hallucination

📖 A curated list of resources dedicated to hallucination of multimodal large language models (MLLM).
340
star
8

all-in-one

[CVPR2023] All in One: Exploring Unified Video-Language Pre-training
Python
277
star
9

BoxDiff

[ICCV 2023] BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion
Python
239
star
10

DeVRF

The Pytorch implementation of "DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes"
Python
179
star
11

EgoVLP

[NeurIPS2022] Egocentric Video-Language Pretraining
Python
140
star
12

VisorGPT

[NeurIPS 2023] Customize spatial layouts for conditional image synthesis models, e.g., ControlNet, using GPT
Python
129
star
13

Awesome-GUI-Agent

💻 A curated list of papers and resources for multi-modal Graphical User Interface (GUI) agents.
109
star
14

Awesome-Unified-Multimodal-Models

📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
106
star
15

ShowAnything

Jupyter Notebook
79
star
16

cosmo

Python
70
star
17

loveu-tgve-2023

Official GitHub repository for the Text-Guided Video Editing (TGVE) competition of LOVEU Workshop @ CVPR'23.
Python
68
star
18

sparseformer

(ICLR 2024, CVPR 2024) SparseFormer
Python
62
star
19

datacentric.vlp

Compress conventional Vision-Language Pre-training data
Python
48
star
20

Region_Learner

The Pytorch implementation for "Video-Text Pre-training with Learned Regions"
Python
42
star
21

ShowRoom3D

This is the project page of ShowRoom3D
24
star
22

Long-form-Video-Prior

Python
22
star
23

DemoVLP

[Arxiv2022] Revitalize Region Feature for Democratizing Video-Language Pre-training
Python
21
star
24

CLVQA

[AAAI2023 (Oral)] Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task
Python
19
star
25

BYOC

[IEEE-VR 2024] Bring Your Own Character: A Holistic Solution for Automatic Facial Animation Generation of Customized Characters
C#
19
star
26

Q2A

[ECCV 2022] AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant
Python
18
star
27

HOSNeRF

This is the project page for the HOSNeRF
JavaScript
15
star
28

headshot

12
star
29

GEB-Plus

[ECCV 2022] GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Retrieval
Python
12
star
30

LOVA3

[NeurIPS 2024] "Learning to Visual Question Answering, Asking and Assessment"
Python
12
star
31

Show-Anything-3D

Edit and Generate Anything in 3D world!
11
star
32

Awesome-Long-Context

A curated list of resources about long-context in large-language models and video understanding.
10
star
33

SCT

[IJCV2023] Offical implementation of "SCT: A Simple Baseline for Parameter-Efficient Fine-Tuning via Salient Channels"
Python
10
star
34

VisInContext

Official implementation of Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Python
9
star
35

SOIS

The Pytorch implementation of "Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization"
8
star
36

AVA-AVD

Python
7
star
37

Efficient-CLS

[arXiv2022] Label-Efficient Online Continual Object Detection in Streaming Video
6
star
38

videollm-online

VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Python
6
star
39

Tune-An-Ellipse

[CVPR 2024] Tune-An-Ellipse: CLIP Has Potential to Find What You Want
6
star
40

mist

5
star
41

ColonNeRF

This is the project page for ColonNeRF.
JavaScript
4
star
42

DynVideo-E

This is the project page for DynVideo-E.
JavaScript
3
star
43

VideoLISA

[NeurlPS 2024] One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
3
star
44

TTC-Tuning

Revisit Parameter-Efficient Transfer Learning: A Two-Stage Paradigm
2
star
45

assistq

SCSS
1
star