• Stars
    star
    206
  • Rank 190,504 (Top 4 %)
  • Language
    Python
  • Created over 3 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[CVPR'21 Oral] Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning

Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning [CVPR'21, Oral]

By Zhicheng Huang*, Zhaoyang Zeng*, Yupan Huang*, Bei Liu, Dongmei Fu and Jianlong Fu

arxiv: https://arxiv.org/pdf/2104.03135.pdf

Introduction

This is the official implementation of the paper. In this paper, we propose SOHO to "See Out of tHe bOx" that takes a whole image as input, and learns vision-language representation in an end-to-end manner. SOHO does not require bounding box annotations which enables inference 10 times faster than region-based approaches.

Architecture

Release Progress

  • VQA Codebase

  • Pre-training Codebase

Installation

conda create -n soho python=3.7
conda activate soho
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge 
git clone https://github.com/NVIDIA/apex.git
cd apex
python setup.py install --cuda_ext --cpp_ext
cd ../ && rm -rf apex
git clone https://github.com/researchmm/soho.git
cd $SOHO_ROOT
python setup.py develop

Getting Started

  1. Download the training, validation and test data

    # download Pre-traning dataset
    mkdir -p $SOHO_ROOT/data/vg_coco_pre
    cd $SOHO_ROOT/data/vg_coco_pre
    wget http://images.cocodataset.org/zips/train2014.zip
    wget http://images.cocodataset.org/zips/val2014.zip
    #download vg dataset
    wget https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip
    wget https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip
    unzip images.zip
    unzip images2.zip
    rm -rf images.zip images2.zip
    mv VG_100K_2/*.jpg VG_100K/
    cd VG_100K
    zip -r images.zip .
    mv images.zip ../
    cd ..
    rm -rf VG_100K*
    wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/coco_cap_train_pre.json
    wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/coco_cap_val_pre.json
    wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/pretraining/vg_cap_pre.json
    mkdir -p $SOHO_ROOT/data/coco
    cd $SOHO_ROOT/data/coco
    # download VQA dataset
    wget http://images.cocodataset.org/zips/train2014.zip
    wget http://images.cocodataset.org/zips/val2014.zip
    wget http://images.cocodataset.org/zips/test2015.zip
    wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/vqa/train_data_vqa.json
    wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/vqa/val_data_vqa.json
    wget https://sohose.s3.ap-southeast-1.amazonaws.com/data/vqa/test_data_vqa.json
  2. Train the Pre-training models

    cd $SOHO_ROOT
    # use 8 GPUS to train the model
    bash tools/dist_train.sh configs/Pretrain/soho_res18_pre.py 8
    
    # you also can download the pre-trained models 
    mkdir -p $SOHO_ROOT/work_dirs/pretrained
    cd $SOHO_ROOT/work_dirs/pretrained
    # download pre-training weight
    wget https://sohose.s3.ap-southeast-1.amazonaws.com/checkpoint/soho_res18_fp16_40-9441cdd3.pth
  3. Training a VQA model

    cd $SOHO_ROOT
    # use 8 GPUS to train the model
    bash tools/dist_train.sh configs/VQA/soho_res18_vqa.py 8
  4. Evaluate a VQA model

    # test 18 epoch with 8GPUs
    bash tools/dist_test_vqa.sh configs/VQA/soho_res18_vqa.py 18 8

Citation

If you find this repo useful in your research, please consider citing the following papers:

@inproceedings{huang2021seeing,
  title={Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Huang, Yupan and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  booktitle={The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2021}
}

@article{huang2020pixel,
  title={Pixel-bert: Aligning image pixels with text by deep multi-modal transformers},
  author={Huang, Zhicheng and Zeng, Zhaoyang and Liu, Bei and Fu, Dongmei and Fu, Jianlong},
  journal={arXiv preprint arXiv:2004.00849},
  year={2020}
}

Acknowledgements

We would like to thank mmcv and mmdetection. Our commons lib is based on mmcv.

More Repositories

1

TTSR

[CVPR'20] TTSR: Learning Texture Transformer Network for Image Super-Resolution
Python
765
star
2

SiamDW

[CVPR'19 Oral] Deeper and Wider Siamese Networks for Real-Time Visual Tracking
Python
750
star
3

Stark

[ICCV'21] Learning Spatio-Temporal Transformer for Visual Tracking
Python
645
star
4

TracKit

[ECCV'20] Ocean: Object-aware Anchor-Free Tracking
Python
612
star
5

STTN

[ECCV'2020] STTN: Learning Joint Spatial-Temporal Transformations for Video Inpainting
Jupyter Notebook
465
star
6

AOT-GAN-for-Inpainting

[TVCG'2023] AOT-GAN for High-Resolution Image Inpainting (codebase for image inpainting)
Python
424
star
7

LightTrack

[CVPR21] LightTrack: Finding Lightweight Neural Network for Object Tracking via One-Shot Architecture Search
Python
396
star
8

MM-Diffusion

[CVPR'23] MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation
Python
389
star
9

PEN-Net-for-Inpainting

[CVPR'2019] PEN-Net: Learning Pyramid-Context Encoder Network for High-Quality Image Inpainting
Python
357
star
10

img2poem

[MM'18] Beyond Narrative Description: Generating Poetry from Images by Multi-Adversarial Training
Python
280
star
11

tasn

Trilinear Attention Sampling Network for Fine-grained Image Recognition
Python
218
star
12

TTVSR

[CVPR'22 Oral] TTVSR: Learning Trajectory-Aware Transformer for Video Super-Resolution
Python
199
star
13

FTVSR

[ECCV'22] FTVSR: Learning Spatiotemporal Frequency-Transformer for Compressed Video Super-Resolution
Python
154
star
14

DBTNet

Code for our NeurIPS'19 paper "Learning Deep Bilinear Transformation for Fine-grained Image Representation"
Python
105
star
15

generate-it

A collection of models for image<->text generation in ACM MM 2021.
Python
64
star
16

CKDN

[ICCV'21] CKDN: Learning Conditional Knowledge Distillation for Degraded-Reference Image Quality Assessment
Python
55
star
17

SariGAN

[NeurIPS'20] Learning Semantic-aware Normalization for Generative Adversarial Networks
Python
53
star
18

VOT2019

The Winner and Runner-up Trackers for VOT-2019 Challenges
Python
51
star
19

WSOD2

[ICCV'19] WSOD^2: Learning Bottom-up and Top-down Objectness Distillation for Weakly-supervised Object Detection
Python
47
star
20

VQD-SR

[ICCV'23] VQD-SR: Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution
Python
37
star
21

CyDAS

Cyclic Differentiable Architecture Search
Python
34
star
22

NEAS

Python
19
star
23

2D-TAN

AAAI2020 - Learning 2D Temporal Localization Networks for Moment Localization with Natural Language
Python
17
star
24

STTR

[ACCV'22] Fine-Grained Image Style Transfer with Visual Transformers
Python
14
star
25

AAST-pytorch

[MM'20] Aesthetic-Aware Image Style Transfer
Python
14
star
26

davinci-videofactory

JavaScript
12
star
27

AI_Illustrator

[MM'22 Oral] AI Illustrator: Translating Raw Descriptions into Images by Prompt-based Cross-Modal Generation
Python
11
star
28

language-guided-animation

[TMM 2023] Language-Guided Face Animation by Recurrent StyleGAN-based Generator
Python
11
star