Paper | Demo | Checkpoints | Datasets | ModelScope
ONE-PEACE is a general representation model across vision, audio, and language modalities, Without using any vision or language pretrained model for initialization, ONE-PEACE achieves leading results in vision, audio, audio-language, and vision-language tasks. Furthermore, ONE-PEACE possesses a strong emergent zero-shot retrieval capability, enabling it to align modalities that are not paired in the training data.
Below shows the architecture and pretraining tasks of ONE-PEACE. With the scaling-friendly architecture and modality-agnostic tasks, ONE-PEACE has the potential to expand to unlimited modalities.
Online Demo
We provide the online demo in Huggingface Spaces. In this demo, you can combine multiple modalities to retrieve related images, such as audio-to-image, audio+text-to-image, audio+image-to-image, and even audio+image+text-to-image.
News
- 2023.6.23: Released vision tasks fine-tuning scripts and checkpoints. See guidance for vision tasks for more details.
- 2023.6.04: Released the pretraining scripts. See guidance for pretraining for more details.
- 2023.5.30: Released the finetuned checkpoints and scripts for audio(-language) tasks.
- 2023.5.29: Released the finetuned checkpoints for vision-language tasks.
- 2023.5.27:
🔥 We have provided the multimodal retrieval demo in huggingface spaces. Have Fun! - 2023.5.25: Released the easy-to-use API, which enables the quick extraction for image, audio and text representations.
- 2023.5.23: Released the pretrained checkpoint, as well as finetuning & inference scripts for vision-language tasks.
- 2023.5.19: Released the paper and code. Pretrained & finetuned checkpoints, training & inference scripts, as well as demos will be released as soon as possible.
Models and Results
Model Card
We list the parameters and pretrained checkpoints of ONE-PEACE below. Note that ONE-PEACE can be disassembled into different branches to handle different tasks. We also provide the vision-branch of ONE-PEACE, which can be used to perform vision tasks.
Model | Ckpt | Params | Hidden size | Intermediate size | Attention heads | Layers |
---|---|---|---|---|---|---|
ONE-PEACE | Download | 4B | 1536 | 6144 | 24 | 40 |
ONE-PEACE (Vision Branch) | Download | 1.5B | 1536 | 6144 | 24 | 40 |
Results
Vision Tasks
Task | Image classification | Semantic Segmentation | Object Detection (w/o Object365) | Video Action Recognition |
---|---|---|---|---|
Dataset | Imagenet-1K | ADE20K | COCO | Kinetics 400 |
Split | val | val | val | val |
Metric | Acc. | mIoUss / mIoUms | APbox / APmask | Top-1 Acc. / Top-5 Acc. |
ONE-PEACE | 89.8 | 62.0 / 63.0 | 60.4 / 52.9 | 88.1 / 97.8 |
Audio(-language) Tasks
Task | Audio-Text Retrieval | Audio Classification | Audio Question Answering | |||||
---|---|---|---|---|---|---|---|---|
Dataset | AudioCaps | Clotho | ESC-50 | FSD50K | VGGSound (Audio Only) | AVQA (Audio + Question) | ||
Split | test | evaluation | full | eval | test | val | ||
Metric | T2A R@1 | A2T R@1 | T2A R@1 | A2T R@1 | Zero-shot Acc. | MAP | Acc. | Acc. |
ONE-PEACE | 42.5 | 51.0 | 22.4 | 27.1 | 91.8 | 69.7 | 59.6 | 86.2 |
Vision-Language Tasks
Task | Image-Text Retrieval (w/o ranking) | Visual Grounding | VQA | Visual Reasoning | |||||
---|---|---|---|---|---|---|---|---|---|
Dataset | COCO | Flickr30K | RefCOCO | RefCOCO+ | RefCOCOg | VQAv2 | NLVR2 | ||
Split | test | test | val / testA / testB | val / testA / testB | val-u / test-u | test-dev / test-std | dev / test-P | ||
Metric | I2T R@1 | T2I R@1 | I2T R@1 | T2I R@1 | [email protected] | Acc. | Acc. | ||
ONE-PEACE | 84.1 | 65.4 | 97.6 | 89.6 | 92.58 / 94.18 / 89.26 | 88.77 / 92.21 / 83.23 | 89.22 / 89.27 | 82.6 / 82.5 | 87.8 / 88.3 |
Requirements and Installation
- Python >= 3.7
- Pytorch >= 1.10.0 (recommend 1.13.1)
- CUDA Version >= 10.2 (recommend 11.6)
- Install required packages:
git clone https://github.com/OFA-Sys/ONE-PEACE
cd ONE-PEACE
pip install -r requirements.txt
- For faster training install Apex library (optional):
git clone https://github.com/NVIDIA/apex
cd apex && pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--distributed_adam" --global-option="--deprecated_fused_adam" ./
- Install Xformers library to use Memory-efficient attention (optional):
conda install xformers -c xformers
- Install FlashAttention library to use faster LayerNorm (optional):
git clone --recursive https://github.com/HazyResearch/flash-attention
cd flash-attn && pip install .
cd csrc/layer_norm && pip install .
Datasets and Checkpoints
See datasets.md and checkpoints.md.
Usage
API
We provide a simple code snippet to show how to use the API for ONE-PEACE. We use ONE-PEACE to compute embeddings for text, images, and audio, as well as their similarities:
import torch
from one_peace.models import from_pretrained
device = "cuda" if torch.cuda.is_available() else "cpu"
# "ONE-PEACE" can also be replaced with ckpt path
model = from_pretrained("ONE-PEACE", device=device, dtype="float32")
# process raw data
src_tokens = model.process_text(["cow", "dog", "elephant"])
src_images = model.process_image(["demo_assets/dog.JPEG", "demo_assets/elephant.JPEG"])
src_audios, audio_padding_masks = model.process_audio(["demo_assets/cow.flac", "demo_assets/dog.flac"])
with torch.no_grad():
# extract normalized features
text_features = model.extract_text_features(src_tokens)
image_features = model.extract_image_features(src_images)
audio_features = model.extract_audio_features(src_audios, audio_padding_masks)
# compute similarity
i2t_similarity = image_features @ text_features.T
a2t_similarity = audio_features @ text_features.T
print("Image-to-text similarities:", i2t_similarity)
print("Audio-to-text similarities:", a2t_similarity)
Training & Inference
If you are not satisfied with only using the API, we offer comprehensive training and inference instructions for audio & multimodal and vision tasks.
Gallery
Visual Grounding (unseen domain)
Emergent Zero-shot Retrieval
Acknowledgement
- Fairseq A sequence modeling toolkit with flexible configuration and highly extensible code structure.
- xFormers A toolbox to accelerate research on Transformers.
- FlashAttention A repository that provides the official implementation of FlashAttention, which greatly speeds up multi-head attention.
- Apex A repository that provides useful model acceleration and memory optimization techniques.
Getting Involved
Feel free to submit Github issues or pull requests. Welcome to contribute to our project!
To contact us, never hestitate to send an email to [email protected]
or [email protected]
!
Citation
If you find our paper and code useful in your research, please consider giving a star
@article{wang2023one,
title={ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities},
author={Wang, Peng and Wang, Shijie and Lin, Junyang and Bai, Shuai and Zhou, Xiaohuan and Zhou, Jingren and Wang, Xinggang and Zhou, Chang},
journal={arXiv preprint arXiv:2305.11172},
year={2023}
}