VILLA: Vision-and-Language Adversarial Training
This is the official repository of VILLA (NeurIPS 2020 Spotlight). This repository currently supports adversarial finetuning of UNITER on VQA, VCR, NLVR2, and SNLI-VE. Adversarial pre-training with in-domain data will be available soon. Both VILLA-base and VILLA-large pre-trained checkpoints are released.
Most of the code in this repo are copied/modified from UNITER.
Requirements
We provide Docker image for easier reproduction. Please install the following:
- nvidia driver (418+),
- Docker (19.03+),
- nvidia-container-toolkit.
Our scripts require the user to have the docker group membership so that docker commands can be run without sudo. We only support Linux with NVIDIA GPUs. We test on Ubuntu 18.04 and V100 cards. We use mixed-precision training hence GPUs with Tensor Cores are recommended.
Quick Start
NOTE: Please run bash scripts/download_pretrained.sh $PATH_TO_STORAGE
to get our latest pretrained VILLA
checkpoints. This will download both the base and large models.
We use VQA as an end-to-end example for using this code base.
-
Download processed data and pretrained models with the following command.
bash scripts/download_vqa.sh $PATH_TO_STORAGE
After downloading you should see the following folder structure:
├── finetune ├── img_db │  ├── coco_test2015 │  ├── coco_test2015.tar │  ├── coco_train2014 │  ├── coco_train2014.tar │  ├── coco_val2014 │  ├── coco_val2014.tar │  ├── vg │  └── vg.tar ├── pretrained   ├── uniter-base.pt │  └── villa-base.pt └── txt_db   ├── vqa_devval.db   ├── vqa_devval.db.tar   ├── vqa_test.db   ├── vqa_test.db.tar   ├── vqa_train.db   ├── vqa_train.db.tar   ├── vqa_trainval.db   ├── vqa_trainval.db.tar   ├── vqa_vg.db   └── vqa_vg.db.tar
You can put different pre-trained checkpoints inside the /pretrained folder based on your need.
-
Launch the Docker container for running the experiments.
# docker image should be automatically pulled source launch_container.sh $PATH_TO_STORAGE/txt_db $PATH_TO_STORAGE/img_db \ $PATH_TO_STORAGE/finetune $PATH_TO_STORAGE/pretrained
The launch script respects $CUDA_VISIBLE_DEVICES environment variable. Note that the source code is mounted into the container under
/src
instead of built into the image so that user modification will be reflected without re-building the image. (Data folders are mounted into the container separately for flexibility on folder structures.) -
Run finetuning for the VQA task.
# inside the container horovodrun -np $N_GPU python train_vqa_adv.py --config $YOUR_CONFIG_JSON # specific example horovodrun -np 4 python train_vqa_adv.py --config config/train-vqa-base-4gpu-adv.json
-
Run inference for the VQA task and then evaluate.
# inference python inf_vqa.py --txt_db /txt/vqa_test.db --img_db /img/coco_test2015 \ --output_dir $VQA_EXP --checkpoint 6000 --pin_mem --fp16
The result file will be written at
$VQA_EXP/results_test/results_6000_all.json
, which can be submitted to the evaluation server -
Customization
# training options python train_vqa_adv.py --help
- command-line argument overwrites JSON config files
- JSON config overwrites
argparse
default value. - use horovodrun to run multi-GPU training
--gradient_accumulation_steps
emulates multi-gpu training--checkpoint
selects UNITER or VILLA pre-trained checkpoints--adv_training
decides using adv. training or not--adv_modality
takes values from ['text'], ['image'], ['text','image'], and ['text','image','alter'], the last two correspond to adding perturbations on two modalities simultaneously or alternatively
Downstream Tasks Finetuning
VCR
NOTE: train and inference should be ran inside the docker container
- download data
bash scripts/download_vcr.sh $PATH_TO_STORAGE
- train
horovodrun -np 4 python train_vcr_adv.py --config config/train-vcr-base-4gpu-adv.json \ --output_dir $VCR_EXP
- inference
The result file will be written at
horovodrun -np 4 python inf_vcr.py --txt_db /txt/vcr_test.db \ --img_db "/img/vcr_gt_test/;/img/vcr_test/" \ --split test --output_dir $VCR_EXP --checkpoint 8000 \ --pin_mem --fp16
$VCR_EXP/results_test/results_8000_all.csv
, which can be submitted to VCR leaderboard for evaluation.
NLVR2
NOTE: train and inference should be ran inside the docker container
- download data
bash scripts/download_nlvr2.sh $PATH_TO_STORAGE
- train
horovodrun -np 4 python train_nlvr2_adv.py --config config/train-nlvr2-base-1gpu-adv.json \ --output_dir $NLVR2_EXP
- inference
python inf_nlvr2.py --txt_db /txt/nlvr2_test1.db/ --img_db /img/nlvr2_test/ \ --train_dir /storage/nlvr-base/ --ckpt 6500 --output_dir . --fp16
Visual Entailment (SNLI-VE)
NOTE: train should be ran inside the docker container
- download data
bash scripts/download_ve.sh $PATH_TO_STORAGE
- train
horovodrun -np 2 python train_ve_adv.py --config config/train-ve-base-2gpu-adv.json \ --output_dir $VE_EXP
Adversarial Training of LXMERT
To keep things simple, we provide another separate repo that can be used to reproduce our results on adversarial finetuning of LXMERT on VQA, GQA, and NLVR2.
Citation
If you find this code useful for your research, please consider citing:
@inproceedings{gan2020large,
title={Large-Scale Adversarial Training for Vision-and-Language Representation Learning},
author={Gan, Zhe and Chen, Yen-Chun and Li, Linjie and Zhu, Chen and Cheng, Yu and Liu, Jingjing},
booktitle={NeurIPS},
year={2020}
}
@inproceedings{chen2020uniter,
title={Uniter: Universal image-text representation learning},
author={Chen, Yen-Chun and Li, Linjie and Yu, Licheng and Kholy, Ahmed El and Ahmed, Faisal and Gan, Zhe and Cheng, Yu and Liu, Jingjing},
booktitle={ECCV},
year={2020}
}
License
MIT