• Stars
    star
    3,560
  • Rank 12,458 (Top 0.3 %)
  • Language
    Python
  • License
    MIT License
  • Created over 1 year ago
  • Updated 9 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

🦦 Otter, a multi-modal model based on OpenFlamingo (open-sourced version of DeepMind's Flamingo), trained on MIMIC-IT and showcasing improved instruction-following and in-context learning ability.

1S-Lab, Nanyang Technological University  2Microsoft Research, Redmond
Co-Project Lead  * Equal Contribution  Corresponding Author

Hits

Project Page | Otter Paper | MIMIC-IT Paper | MIMIC-IT Dataset

Video Demo: Otter's Conceptual Demo Video | Bilibili 哔哩哔哩

Interactive Demo: Otter-Image | Otter-Video

Our models would be sometimes offline due to GPU limitation (if we need to train new models lol). You can refer to 🏎️ Run Otter Locally to try Otter-Image and Otter-Video more smoothly on your local machine, with at least 16G GPU mem (BF16/FP16 Mode) to help your tasks like image/video tagging, captioning or identifying harmful content.

Corresponding Checkpoints: luodian/OTTER-Image-MPT7B | luodian/OTTER-Video-LLaMA7B-DenseCaption

For who in the mainland China: Open in OpenXLab | Open in OpenXLab)

Otter-Image supports multiple images input as in-context examples, which is the first multi-modal instruction tuned model that supports to organize inputs this way.

Otter-Video supports videos inputs (frames are arranged as original Flamingo's implementation) and multiple images inputs (they serve as in-context examples for each other).

Eval Results: Multi-Modal Arena | MLLM Evaluation Benchmark | OpenCompass-MMBench

🦾 Update

Contact: Leave issue or [email protected]/[email protected]. We are on call to respond.

[2023-07]

  1. 🧨 Feature Updates:
    • DeepSpeed ZeRo2 Integration + DDP Training
    • Support Flamingo pretraining on Laion400M/CC3M.
    • Add LoRA support for tuning LLM decoder.
    • Integration of multiple LLMs (Vicuna, MPT, LLama2, Falcon)
  2. 🤗 Checkout MIMIC-IT on Huggingface datasets.
  3. 🦦 Checkout our Otter-MPT7B Image Demo. We update the model by incoporating OpenFlamingv2 and specifically tune it to enable generation abilities for both long and short answers.
  4. 🥚 Update Eggs section for downloading MIMIC-IT dataset.
  5. 🥃 Contact us if you wish to develop Otter for your scenarios (for satellite images or funny videos?). We aim to support and assist with Otter's diverse use cases. OpenFlamingo and Otter are strong models with the Flamingo's excellently designed architecture that accepts multiple images/videos or other modality inputs. Let's build more interesting models together.

[2023-06]

  1. 🧨 Download MIMIC-IT Dataset. For more details on navigating the dataset, please refer to MIMIC-IT Dataset README.
  2. 🏎️ Run Otter Locally. You can run our model locally with at least 16G GPU mem for tasks like image/video tagging and captioning and identifying harmful content. We fix a bug related to video inference where frame tensors were mistakenly unsqueezed to a wrong vision_x.

    Make sure to adjust the sys.path.append("../..") correctly to access otter.modeling_otter in order to launch the model.

  3. 🏇 We welcome third-party evaluation on Otter and we are willing to see different VLMs chasing with each other on different arenas and benchmarks. But make sure contact us to confirm the model version and prompt strategy before publishing results. We are on call to respond.
  4. 🤗 Introducing Project Otter's brand new homepage: https://otter-ntu.github.io/. Check it out now!
  5. 🤗 Check our paper introducing MIMIC-IT in details. Meet MIMIC-IT, the first multimodal in-context instruction tuning dataset with 2.8M instructions! From general scene understanding to spotting subtle differences and enhancing egocentric view comprehension for AR headsets, our MIMIC-IT dataset has it all.

🦦 Why In-Context Instruction Tuning?

Large Language Models (LLMs) have demonstrated exceptional universal aptitude as few/zero-shot learners for numerous tasks, owing to their pre-training on extensive text data. Among these LLMs, GPT-3 stands out as a prominent model with significant capabilities. Additionally, variants of GPT-3, namely InstructGPT and ChatGPT, have proven effective in interpreting natural language instructions to perform complex real-world tasks, thanks to instruction tuning.

Motivated by the upstream interleaved format pretraining of the Flamingo model, we present 🦦 Otter, a multi-modal model based on OpenFlamingo (the open-sourced version of DeepMind's Flamingo). We train our Otter in an in-context instruction tuning way on our proposed MI-Modal In-Context Instruction Tuning (MIMIC-IT) dataset. Otter showcases improved instruction-following and in-context learning ability in both images and videos.

🗄 MIMIC-IT Dataset Details

MIMIC-IT enables the application of egocentric visual assistant model that can serve that can answer your questions like Hey, Do you think I left my keys on the table?. Harness the power of MIMIC-IT to unlock the full potential of your AI-driven visual assistant and elevate your interactive vision-language tasks to new heights.

We also introduce Syphus, an automated pipeline for generating high-quality instruction-response pairs in multiple languages. Building upon the framework proposed by LLaVA, we utilize ChatGPT to generate instruction-response pairs based on visual content. To ensure the quality of the generated instruction-response pairs, our pipeline incorporates system messages, visual annotations, and in-context examples as prompts for ChatGPT.

For more details, please check the MIMIC-IT dataset.

🤖 Otter Model Details

Otter is designed to support multi-modal in-context instruction tuning based on the OpenFlamingo model, which involves conditioning the language model on the corresponding media, such as an image that corresponds to a caption or an instruction-response pair.

We train Otter on MIMIC-IT dataset with approximately 2.8 million in-context instruction-response pairs, which are structured into a cohesive template to facilitate various tasks. Otter supports videos inputs (frames are arranged as original Flamingo's implementation) and multiple images inputs as in-context examples, which is the first multi-modal instruction tuned model.

The following template encompasses images, user instructions, and model-generated responses, utilizing the User and GPT role labels to enable seamless user-assistant interactions.

prompt = f"<image>User: {instruction} GPT:<answer> {response}<endofchunk>"

Training the Otter model on the MIMIC-IT dataset allows it to acquire different capacities, as demonstrated by the LA and SD tasks. Trained on the LA task, the model exhibits exceptional scene comprehension, reasoning abilities, and multi-round conversation capabilities.

# multi-round of conversation
prompt = f"<image>User: {first_instruction} GPT:<answer> {first_response}<endofchunk>User: {second_instruction} GPT:<answer>"

Regarding the concept of organizing visual-language in-context examples, we demonstrate here the acquired ability of the Otter model to follow inter-contextual instructions after training on the LA-T2T task. The organized input data format is as follows:

# Multiple in-context example with similar instructions
prompt = f"<image>User:{ict_first_instruction} GPT: <answer>{ict_first_response}<|endofchunk|><image>User:{ict_second_instruction} GPT: <answer>{ict_second_response}<|endofchunk|><image>User:{query_instruction} GPT: <answer>"

For more details, please refer to our paper's appendix for other tasks.

🗂️ Environments

  1. Compare cuda version returned by nvidia-smi and nvcc --version. They need to match. Or at least, the version get by nvcc --version should be <= the version get by nvidia-smi.
  2. Install the pytorch that matches your cuda version. (e.g. cuda 11.7 torch 2.0.0). We have successfully run this code on cuda 11.1 torch 1.10.1 and cuda 11.7 torch 2.0.0. You can refer to PyTorch's documentation, Latest or Previous.
  3. You may install via conda env create -f environment.yml. Especially to make sure the transformers>=4.28.0, accelerate>=0.18.0.

After configuring environment, you can use the 🦩 Flamingo model / 🦦 Otter model as a 🤗 Hugging Face model with only a few lines! One-click and then model configs/weights are downloaded automatically. Please refer to Huggingface Otter/Flamingo for details.

☄️ Training

Otter is trained based on OpenFlamingo. You may need to use converted weights at luodian/OTTER-9B-INIT or luodian/OTTER-MPT7B-Init. They are respectively converted from OpenFlamingo-LLaMA7B-v1 and OpenFlamingo-MPT7B-v2, we added a <answer> token for Otter's downstream instruction tuning.

You may also use any trained Otter weights to start with your training on top of ours, see them at Otter Weights. You can refer to MIMIC-IT for preparing image/instruction/train json files.

export PYTHONPATH=.

accelerate launch --config_file=./pipeline/accelerate_configs/accelerate_config_fsdp.yaml \
pipeline/train/instruction_following.py \
--pretrained_model_name_or_path=luodian/OTTER-LLaMA7B-INIT  \ # or --pretrained_model_name_or_path=luodian/OTTER-MPT7B-Init
--mimicit_path="path/to/DC_instruction.json" \
--images_path="path/to/DC.json" \
--train_config_path="path/to/DC_train.json" \
--batch_size=4 \
--num_epochs=9 \
--report_to_wandb \
--wandb_entity=ntu-slab \
--run_name=OTTER-LLaMA7B-densecaption \
--wandb_project=OTTER-LLaMA7B \
--workers=1 \
--lr_scheduler=cosine \
--learning_rate=1e-5 \
--warmup_steps_ratio=0.01 \

📑 Citation

If you found this repository useful, please consider citing:

@article{li2023otter,
  title={Otter: A Multi-Modal Model with In-Context Instruction Tuning},
  author={Li, Bo and Zhang, Yuanhan and Chen, Liangyu and Wang, Jinghao and Yang, Jingkang and Liu, Ziwei},
  journal={arXiv preprint arXiv:2305.03726},
  year={2023}
}

@article{li2023mimicit,
    title={MIMIC-IT: Multi-Modal In-Context Instruction Tuning},
    author={Bo Li and Yuanhan Zhang and Liangyu Chen and Jinghao Wang and Fanyi Pu and Jingkang Yang and Chunyuan Li and Ziwei Liu},
    year={2023},
    eprint={2306.05425},
    archivePrefix={arXiv},
    primaryClass={cs.CV}
}

👨‍🏫 Acknowledgements

We thank Jack Hessel for the advise and support, as well as the OpenFlamingo team for their great contribution to the open source community.

Huge accolades to Flamingo and OpenFlamingo team for the work on this great architecture.

📝 Related Projects

More Repositories

1

RelateAnything

Relate Anything Model is capable of taking an image as input and utilizing SAM to identify the corresponding mask within the image.
Python
441
star
2

Generalizable-Mixture-of-Experts

GMoE could be the next backbone model for many kinds of generalization task.
Python
290
star
3

MADAN

Pytorch Code release for our NeurIPS paper "Multi-source Domain Adaptation for Semantic Segmentation"
Python
171
star
4

Learning-Invariant-Representations-and-Risks

Pytorch code release of CVPR 21 Paper: Learning Invariant Representations and Risks
Python
32
star
5

Mapillary2COCO

Transfer Mapillary Vistas Dataset to Coco format
Python
28
star
6

GenBench

Benchmarking and Analyzing Generative Data for Visual Recognition
Python
26
star
7

Time-Series-Analysis

2017-Summer-Term-Study
23
star
8

IIB

Python
16
star
9

Data-Structure

Data structure and Algorithm
C++
8
star
10

Higher-Cloud-Computing-Project

We are devoting to building a cloud computing platform that leverages idle resources based on mobile or local networks
Java
5
star
11

Shared-Route

New way to explore your campus life.
Java
4
star
12

VisualizeUrText

Lab1-Pair_Programming
Java
3
star
13

HCCP-Patronus

This is an explosive start-up idea bounced out of my mind I was doing my course project. I am not sure when I can achieve them, but he will be sticked there to remind me his existence.
Java
3
star
14

HighPrecisionDetection

Do some experiments
Python
1
star
15

learn_to_crawl

HTML
1
star
16

Codeforces

C++
1
star
17

luodian-LAB-4

Just for SE assignment
Java
1
star
18

Network_Alignment

Task from a UCI professor
C++
1
star
19

Pytorch_Quick_Practices

Practices to quick get into pytorch
Python
1
star
20

GO_Kitti

Currently doing KITTI challenge.
Python
1
star
21

LeetCode

Record my way to improve my coding ability towards algorithms and data structures. Helping me build a solid foundation on the road of scientific research.
C++
1
star
22

Code-contest

C++
1
star
23

Unet-TGS-Salt-Challenge

TGS Salt Identification Challenge
Python
1
star
24

Analysis-WindMachine-Data

Python
1
star
25

I-Love-Study

This is an android app made by WD.Hao and L.Bo
Java
1
star
26

HCCP-Distributed-Download

第一款HCCP上的应用
Java
1
star
27

Patricia

wdh && lb
C++
1
star
28

HIT-OS

实验代码大部分借鉴前人火炬,但是ppt做的很详细,可以一看
C
1
star
29

Machine-Learning-Ng

Ng's public courses in cousera
1
star
30

What-Do-You-Like

First web app
JavaScript
1
star