• Stars
    star
    357
  • Rank 119,149 (Top 3 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

PyTorch code for "Unifying Vision-and-Language Tasks via Text Generation" (ICML 2021)

Unifying Vision-and-Language Tasks via Text Generation

teaser image

Setup

# Create python environment (optional)
conda create -n vlt5 python=3.7
source activate vlt5

# Install python dependencies
pip install -r requirements.txt

# Download T5/BART backbone checkpoint
python download_backbones.py

# For MSCOCO captioning evaluation (optional; for captioning only)
python -c "import language_evaluation; language_evaluation.download('coco')"

Code structure

# Store images, features, and annotations
./datasets
    COCO/
        images/
        featuers/
    VG/
        images/
        features/
    GQA/
        images/
        features/
    nlvr/
        images/
        features/
    RefCOCO/

    ...

# Run feature extraction
./feature_extraction

# Train VL-T5
./VL-T5/
    src/
        modeling_t5.py modeling_bart.py                       <= VL-T5/VL-BART model classes
        pretrain.py, pretrain_data.py, pretrain_model.py      <= pretraining
        vqa.py, vqa_data.py vqa_model.py ...                  <= fine-tuning on downstream tasks (ex. VQA, GQA, NLVR2)
        multitask.py, multitask_data.py multiask_model.py     <= multitask learning on 7 downstream tasks
        param.py                                              <= (argparse) configuration
        tokenization.py                                       <= custom tokenizer
        utils.py, dist_utils.py                               <= utility functions
    snap/                                                     <= store weight checkpoints
    scripts/                                                  <= bash scripts for pretraining and finetuning

API

import sys
sys.path.append('./VL-T5/src')

# Parse configuration
from param import parse_args
args = parse_args(
    backbone='t5-base' # Backbone architecture
    load='./snap/pretrain/VLT5/Epoch30' # Pretrained checkpoint
    parse=False, # False for interactive env (ex. jupyter)
)
# Assign GPU
args.gpu = 0

# Load data loaders
from vqa_data import get_loader
train_loader = get_loader(
    args,
    split=args.train,
    ...
)
val_loader = get_loader(
    args,
    split=args.valid,
    ...
)
test_loader = get_loader(
    args,
    split=args.test,
    ...
)

# Import trainer
from vqa import Trainer
trainer = Trainer(
    args,
    train_loader=train_loader
    val_loader=val_loader
    test_loader=test_loader,
)

# model is attached to trainer
model = trainer.model

# Each task-specific model class is inherited from VLT5/VLBart classes, which are inherited from Huggingface transformers T5/BART classes
print(model)
>>> VLT5VQA(
    (shared): Embedding(...)
    (encoder): JointEncoder(...)
    ...
)

# Training
train_batch = next(iter(train_loader))
model.train_step(train_batch)
>>> {'loss': ... }

# Inference
test_batch = next(iter(test_loader))
model.test_step(test_batch)
>>> {'pred_ans': ... }

To add a new task, you can start with writing 3 files by editing from existing ones.

NEW_TASK_model.py # Define a VLT5NewTask/VLBartNewTask model which inherits VLT5/VLBart class
NEW_TASK_data.py # Define Dataset/DataLoader/Evaluator
NEW_TASK.py # Define a trainer which inherits TrainerBase (trainer_base.py)

Download Pre-trained models / Pre-extracted features

We host model checkpoints and features via google drive. We recommend using gdrive to download them.

Pretrained Models

gdrive download 1_SBj4sZ0gUqfBon1gFBiNRAmfHv5w_ph --recursive

COCO+VG pretraining (default)

  • VL-T5/snap/pretrain/VLT5/Epoch30.pth: VL-T5 pretrained for 30 epochs on COCO+VG
  • VL-T5/snap/pretrain/VLBart/Epoch30.pth: VL-BART pretrained for 30 epochs on COCO+VG

VCR pretraining (2nd stage)

  • VL-T5/snap/vcr_pretrain/VLT5/Epoch20.pth: VL-T5 further pretrained for 20 epochs on VCR
  • VL-T5/snap/vcr_pretrain/VLBart/Epoch20.pth: VL-BART further pretrained for 20 epochs on VCR

Dataset Preparation / Feature extraction

gdrive download 1MBBhlkP83VMKS2Qe0SmFfzkHhMpIG5wf --recursive
  • Multi30K only
    • git clone --recursive https://github.com/multi30k/dataset ./datasets/multi30k-dataset
    • unzip train.en.gz, val.en.gz, test_2017_flickr.en.gz, test_2018_flickr.en.gz in ./datasets/multi30k-dataset/data/task1/raw/
    • unzip train.de.gz, val.de.gz, test_2017_flickr.de.gz, test_2018_flickr.de.gz in ./datasets/multi30k-dataset/data/task1/raw/
  • For manual feature extraction, please checkout ./feature_extraction

Pretraining on COCO+VG

# Pretraining with 4 gpus
cd VL-T5/
bash scripts/COCOVG_pretrain_VLT5.sh 4
bash scripts/COCOVG_pretrain_VLBart.sh 4

Downstream tasks

VQA

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/VQA_VLT5.sh 4
bash scripts/VQA_VLBart.sh 4

GQA

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/GQA_VLT5.sh 4
bash scripts/GQA_VLBart.sh 4

NLVR2

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/NLVR_VLT5.sh 4
bash scripts/NLVR_VLBart.sh 4

RefCOCOg

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/RefCOCOg_VLT5.sh 4
bash scripts/RefCOCOG_VLBart.sh 4

VCR

# Pretraining on VCR with 4 gpus (optional)
cd VL-T5/
bash scripts/VCR_pretrain_VLT5.sh 4
bash scripts/VCR_pretrain_VLBart.sh 4

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/VCR_VLT5.sh 4
bash scripts/VCR_VLBart.sh 4

COCO Caption

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/COCOCaption_VLT5.sh 4
bash scripts/COCOCaption_VLBart.sh 4

Multi30K

# Finetuning with 4 gpus
cd VL-T5/
bash scripts/Multi30K_VLT5.sh 4
bash scripts/Multi30K_VLBart.sh 4

Reference

Please cite our paper if you use our models in your works:

@inproceedings{cho2021vlt5,
  title     = {Unifying Vision-and-Language Tasks via Text Generation},
  author    = {Jaemin Cho and Jie Lei and Hao Tan and Mohit Bansal},
  booktitle = {ICML},
  year      = {2021}
}

More Repositories

1

tf_tutorial_plus

Tutorials for TensorFlow APIs the official documentation doesn't cover
Jupyter Notebook
290
star
2

Adversarial_Video_Summary

Unofficial PyTorch Implementation of SUM-GAN from "Unsupervised Video Summarization with Adversarial LSTM Networks" (CVPR 2017)
Python
239
star
3

CLIP-Caption-Reward

PyTorch code for "Fine-grained Image Captioning with CLIP Reward" (Findings of NAACL 2022)
Python
233
star
4

DL-for-Chatbot

Deep Learning / NLP tutorial for Chatbot Developers
Jupyter Notebook
228
star
5

DallEval

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models (ICCV 2023)
Jupyter Notebook
136
star
6

HiREST

Hierarchical Video-Moment Retrieval and Step-Captioning (CVPR 2023)
Python
88
star
7

Dropouts

PyTorch Implementations of Dropout Variants
Jupyter Notebook
86
star
8

MoChA-pytorch

PyTorch Implementation of "Monotonic Chunkwise Attention" (ICLR 2018)
Python
76
star
9

DSG

Davidsonian Scene Graph (DSG) for Text-to-Image Evaluation (ICLR 2024)
Jupyter Notebook
74
star
10

VPGen

Visual Programming for Text-to-Image Generation and Evaluation (NeurIPS 2023)
Jupyter Notebook
50
star
11

PixelCNN

PyTorch implementation of PixelCNN from "Pixel Recurrent Neural Networks"
Python
38
star
12

Easy-Namuwiki-Extractor

Easy Namuwiki Extractor
Python
28
star
13

IterInpaint

Code for IterInpaint model, presented in Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation (CVPR 2024 workshop)
Jupyter Notebook
23
star
14

WikiExtractor_To_the_one_text

Simple extension of WikiExtractor(https://github.com/attardi/wikiextractor)
Python
16
star
15

conv_s2s

Convolutional Sequence-to-Sequence (Work in Progress)
Jupyter Notebook
10
star
16

pytorch_exercise

Jupyter Notebook
7
star
17

pytorch-docker

Dockerfile
3
star
18

generative_models

PyTorch Implementations of Generative models
Python
3
star
19

LayoutBench

Evaluation code for LayoutBench, presented in Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation (CVPR 2024 Workshop)
Python
2
star
20

dotfiles

my dotfile configurations
Shell
2
star
21

LayoutBench-COCO

Jupyter Notebook
2
star
22

ml_snips

Custom ML snippets
Python
1
star
23

motion2csv

Recording Motion with skeleton coordinates using MS Kinect v2
C++
1
star
24

pydropbox

Simple Wrapper for Dropbox Python API
Python
1
star
25

arxiv_reader_slackbot

Slack bot that reads Arxiv abstract from urls.
Python
1
star
26

Udacity_Deep-Learning_Assignments

Udacity Deep Learning course assignments (based on Tensorflow)
Jupyter Notebook
1
star
27

Math-for-IE

Codes for Mathematical Methods for Industrial and Management Engineering (산업경영수리기법) @ SNU 2017-F
Jupyter Notebook
1
star