• Stars
    star
    3,198
  • Rank 14,035 (Top 0.3 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 1 year ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

InternGPT (iGPT) is an open source demo platform where you can easily showcase your AI models. Now it supports DragGAN, ChatGPT, ImageBind, multimodal chat like GPT-4, SAM, interactive image editing, etc. Try it at igpt.opengvlab.com (支持DragGAN、ChatGPT、ImageBind、SAM的在线Demo系统)

[中文文档]

The project is still under construction, we will continue to update it and welcome contributions/pull requests from the community.

| |

🤖💬 InternGPT [Paper]

InternGPT(short for iGPT) / InternChat(short for iChat) is pointing-language-driven visual interactive system, allowing you to interact with ChatGPT by clicking, dragging and drawing using a pointing device. The name InternGPT stands for interaction, nonverbal, and ChatGPT. Different from existing interactive systems that rely on pure language, by incorporating pointing instructions, iGPT significantly improves the efficiency of communication between users and chatbots, as well as the accuracy of chatbots in vision-centric tasks, especially in complicated visual scenarios. Additionally, in iGPT, an auxiliary control mechanism is used to improve the control capability of LLM, and a large vision-language model termed Husky is fine-tuned for high-quality multi-modal dialogue (impressing ChatGPT-3.5-turbo with 93.89% GPT-4 Quality).

🤖💬 Online Demo

InternGPT is online (see https://igpt.opengvlab.com). Let's try it!

[NOTE] It is possible that you are waiting in a lengthy queue. You can clone our repo and run it with your private GPU.

Video Demo with DragGAN:

dragGAN_demo2.mp4

Video Demo with ImageBind:

video_demo_with_imagebind.mp4

iGPT Video Demo:

online_demo.mp4

🥳 🚀 What's New

  • (2023.06.19) We optimize the GPU memory usage when executing the tools. Please refer to Get Started.

  • (2023.06.19) We update the INSTALL.md which provides more detailed instructions for setting up environment.

  • (2023.05.31) It is with great regret that due to some emergency reasons, we have to suspend the online demo. If you want to experience all the features, please try them after deploying locally.

  • (2023.05.24) 🎉🎉🎉 We have supported the DragGAN! Please see the video demo for the usage. Let's try this awesome feauture: Demo. (我们现在支持了功能完全的DragGAN! 可以拖动、可以自定义图片,具体用法见video demo,复现的DragGAN代码在这里,在线demo在这里

  • (2023.05.18) We have supported ImageBind. Please see the video demo for the usage.

  • (2023.05.15) The model_zoo including HuskyVQA has been released! Try it on your local machine!

  • (2023.05.15) Our code is also publicly available on Hugging Face! You can duplicate the repository and run it on your own GPUs.

🧭 User Manual

Update:

(2023.05.24) We now support DragGAN. You can try it as follows:

  • Click the button New Image;
  • Click the image where blue denotes the start point and red denotes the end point;
  • Notice that the number of blue points is the same as the number of red points. Then you can click the button Drag It;
  • After processing, you will receive an edited image and a video that visualizes the editing process.

(2023.05.18) We now support ImageBind. If you want to generate a new image conditioned on audio, you can upload an audio file in advance:

  • To generate a new image from a single audio file, you can send the message like: "generate a real image from this audio";
  • To generate a new image from audio and text, you can send the message like: "generate a real image from this audio and {your prompt}";
  • To generate a new image from audio and image, you need to upload an image and then send the message like: "generate a new image from above image and audio".

Main features:

After uploading the image, you can have a multi-modal dialogue by sending messages like: "what is it in the image?" or "what is the background color of image?".
You also can interactively operate, edit or generate the image as follows:

  • You can click the image and press the button Pick to visualize the segmented region or press the button OCR to recognize the words at chosen position;
  • To remove the masked reigon in the image, you can send the message like: "remove the masked region";
  • To replace the masked reigon in the image, you can send the message like: "replace the masked region with {your prompt}";
  • To generate a new image, you can send the message like: "generate a new image based on its segmentation describing {your prompt}"
  • To create a new image by your scribble, you should press button Whiteboard and draw in the board. After drawing, you need to press the button Save and send the message like: "generate a new image based on this scribble describing {your prompt}".

🗓️ Schedule

  • Support VisionLLM
  • Support Chinese
  • Support MOSS
  • More powerful foundation models based on InternImage and InternVideo
  • More accurate interactive experience
  • OpenMMLab toolkit
  • Web page & code generation
  • Support search engine
  • Low cost deployment
  • Support DragGAN
  • Support ImageBind
  • Response verification for agent
  • Prompt optimization
  • User manual and video demo
  • Support voice assistant
  • Support click interaction
  • Interactive image editing
  • Interactive image generation
  • Interactive visual question answering
  • Segment anything
  • Image inpainting
  • Image caption
  • Image matting
  • Optical character recognition
  • Action recognition
  • Video caption
  • Video dense caption
  • Video highlight interpretation

🏠 System Overview

arch

🎁 Major Features

Remove the masked object

Interactive image editing

Image generation

Interactive visual question answer

Interactive image generation

Video highlight interpretation

🛠️ Installation

See INSTALL.md

👨‍🏫 Get Started

Running the following shell can start a gradio service for our basic features:

python -u app.py --load "HuskyVQA_cuda:0,SegmentAnything_cuda:0,ImageOCRRecognition_cuda:0" --port 3456 -e

if you want to enable the voice assistant, please use openssl to generate the certificate:

mkdir certificate
openssl req -x509 -newkey rsa:4096 -keyout certificate/key.pem -out certificate/cert.pem -sha256 -days 365 -nodes

and then run:

python -u app.py --load "HuskyVQA_cuda:0,SegmentAnything_cuda:0,ImageOCRRecognition_cuda:0" \
--port 3456 --https -e

For all features of our iGPT, you need to run:

python -u app.py \
--load "ImageOCRRecognition_cuda:0,Text2Image_cuda:0,SegmentAnything_cuda:0,ActionRecognition_cuda:0,VideoCaption_cuda:0,DenseCaption_cuda:0,ReplaceMaskedAnything_cuda:0,LDMInpainting_cuda:0,SegText2Image_cuda:0,ScribbleText2Image_cuda:0,Image2Scribble_cuda:0,Image2Canny_cuda:0,CannyText2Image_cuda:0,StyleGAN_cuda:0,Anything2Image_cuda:0,HuskyVQA_cuda:0" \
-p 3456 --https -e

Notice that -e flag can save a lot of memory.

Selectively Loading Features

When you only want to try DragGAN, you just need to load StyleGAN and open the tab "DragGAN":

python -u app.py --load "StyleGAN_cuda:0" --tab "DragGAN" --port 3456 --https -e

In this situation, you can only use the functions of DragGAN, which frees you from some dependencies that you are not interested in.

🎫 License

This project is released under the Apache 2.0 license.

🖊️ Citation

If you find this project useful in your research, please consider cite:

@article{2023interngpt,
  title={InternGPT: Solving Vision-Centric Tasks by Interacting with ChatGPT Beyond Language},
  author={Liu, Zhaoyang and He, Yinan and Wang, Wenhai and Wang, Weiyun and Wang, Yi and Chen, Shoufa and Zhang, Qinglong and Lai, Zeqiang and Yang, Yang and Li, Qingyun and Yu, Jiashuo and others},
  journal={arXiv preprint arXiv:2305.05662},
  year={2023}
}

🤝 Acknowledgement

Thanks to the open source of the following projects:

Hugging FaceLangChainTaskMatrixSAMStable DiffusionControlNetInstructPix2PixBLIPLatent Diffusion ModelsEasyOCRImageBindDragGAN

Welcome to discuss with us and continuously improve the user experience of InternGPT.

WeChat QR Code:

More Repositories

1

InternVL

[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
Python
5,753
star
2

LLaMA-Adapter

[ICLR 2024] Fine-tuning LLaMA to follow Instructions within 1 Hour and 1.2M Parameters
Python
5,717
star
3

DragGAN

Unofficial Implementation of DragGAN - "Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold" (DragGAN 全功能实现,在线Demo,本地部署试用,代码、模型已全部开源,支持Windows, macOS, Linux)
Python
4,996
star
4

Ask-Anything

[CVPR2024 Highlight][VideoChatGPT] ChatGPT with video understanding! And many more supported LMs such as miniGPT4, StableLM, and MOSS.
Python
2,984
star
5

InternImage

[CVPR 2023 Highlight] InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions
Python
2,502
star
6

InternVideo

[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
Python
1,392
star
7

VisionLLM

VisionLLM Series
Python
874
star
8

VideoMamba

[ECCV2024] VideoMamba: State Space Model for Efficient Video Understanding
Python
787
star
9

OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
Python
691
star
10

VideoMAEv2

[CVPR 2023] VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking
Python
486
star
11

DCNv4

[CVPR 2024] Deformable Convolution v4
Python
463
star
12

all-seeing

[ICLR 2024 & ECCV 2024] The All-Seeing Projects: Towards Panoptic Visual Recognition&Understanding and General Relation Comprehension of the Open World"
Python
452
star
13

GITM

Ghost in the Minecraft: Generally Capable Agents for Open-World Environments via Large Language Models with Text-based Knowledge and Memory
445
star
14

Multi-Modality-Arena

Chatbot Arena meets multi-modality! Multi-Modality Arena allows you to benchmark vision-language models side-by-side while providing images as inputs. Supports MiniGPT-4, LLaMA-Adapter V2, LLaVA, BLIP-2, and many more!
Python
428
star
15

Vision-RWKV

Vision-RWKV: Efficient and Scalable Visual Perception with RWKV-Like Architectures
Python
352
star
16

CaFo

[CVPR 2023] Prompt, Generate, then Cache: Cascade of Foundation Models makes Strong Few-shot Learners
Python
344
star
17

PonderV2

PonderV2: Pave the Way for 3D Foundation Model with A Universal Pre-training Paradigm
Python
311
star
18

LAMM

[NeurIPS 2023 Datasets and Benchmarks Track] LAMM: Multi-Modal Large Language Models and Applications as AI Agents
Python
296
star
19

UniFormerV2

[ICCV2023] UniFormerV2: Spatiotemporal Learning by Arming Image ViTs with Video UniFormer
Python
280
star
20

unmasked_teacher

[ICCV2023 Oral] Unmasked Teacher: Towards Training-Efficient Video Foundation Models
Python
276
star
21

OmniCorpus

OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Python
259
star
22

HumanBench

This repo is official implementation of HumanBench (CVPR2023)
Python
231
star
23

Instruct2Act

Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model
Python
223
star
24

EfficientQAT

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models
Python
198
star
25

gv-benchmark

General Vision Benchmark, GV-B, a project from OpenGVLab
Python
189
star
26

ControlLLM

ControlLLM: Augment Language Models with Tools by Searching on Graphs
Python
181
star
27

InternVideo2

152
star
28

UniHCP

Official PyTorch implementation of UniHCP
Python
149
star
29

efficient-video-recognition

Python
114
star
30

SAM-Med2D

Official implementation of SAM-Med2D
Jupyter Notebook
114
star
31

EgoVideo

[CVPR 2024 Champions] Solutions for EgoVis Chanllenges in CVPR 2024
Jupyter Notebook
103
star
32

DiffRate

[ICCV 23]An approach to enhance the efficiency of Vision Transformer (ViT) by concurrently employing token pruning and token merging techniques, while incorporating a differentiable compression rate.
Jupyter Notebook
86
star
33

MMT-Bench

ICML'2024 | MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI
Python
85
star
34

Awesome-DragGAN

Awesome-DragGAN: A curated list of papers, tutorials, repositories related to DragGAN
75
star
35

MM-NIAH

This is the official implementation of the paper "Needle In A Multimodal Haystack"
Python
70
star
36

M3I-Pretraining

69
star
37

STM-Evaluation

Python
69
star
38

MUTR

[AAAI 2024] Referred by Multi-Modality: A Unified Temporal Transformers for Video Object Segmentation
Python
65
star
39

LCL

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Python
63
star
40

ChartAst

ChartAssistant is a chart-based vision-language model for universal chart comprehension and reasoning.
Python
60
star
41

LORIS

Long-Term Rhythmic Video Soundtracker, ICML2023
Python
54
star
42

DDPS

Official Implementation of "Denoising Diffusion Semantic Segmentation with Mask Prior Modeling"
Python
53
star
43

Awesome-LLM4Tool

A curated list of the papers, repositories, tutorials, and anythings related to the large language models for tools
52
star
44

PIIP

NeurIPS 2024 Spotlight ⭐️ Parameter-Inverted Image Pyramid Networks (PIIP)
Python
51
star
45

InternVL-MMDetSeg

Train InternViT-6B in MMSegmentation and MMDetection with DeepSpeed
Jupyter Notebook
50
star
46

GUI-Odyssey

GUI Odyssey is a comprehensive dataset for training and evaluating cross-app navigation agents. GUI Odyssey consists of 7,735 episodes from 6 mobile devices, spanning 6 types of cross-app tasks, 201 apps, and 1.4K app combos.
Python
47
star
47

Siamese-Image-Modeling

[CVPR 2023]Implementation of Siamese Image Modeling for Self-Supervised Vision Representation Learning
Python
33
star
48

De-focus-Attention-Networks

Learning 1D Causal Visual Representation with De-focus Attention Networks
Python
28
star
49

Multitask-Model-Selector

Implementation of Foundation Model is Efficient Multimodal Multitask Model Selector
Python
27
star
50

Official-ConvMAE-Det

Python
13
star
51

perception_test_iccv2023

Champion Solutions repository for Perception Test challenges in ICCV2023 workshop.
Python
13
star
52

opengvlab.github.io

12
star
53

MovieMind

9
star
54

EmbodiedGPT

5
star
55

DriveMLM

3
star
56

.github

2
star