VOLTA: Visiolinguistic Transformer Architectures
This is the implementation of the framework described in the paper:
Emanuele Bugliarello, Ryan Cotterell, Naoaki Okazaki and Desmond Elliott. Multimodal Pretraining Unmasked: A Meta-Analysis and a Unified Framework of Vision-and-Language BERTs. Transactions of the Association for Computational Linguistics 2021; 9 978–994.
We provide the code for reproducing our results, preprocessed data and pretrained models.
News
- 02-2022: Added code for IGLUE (Bugliarello et al., 2022) [Original code]
- 02-2022: Added code for MaRVL (Liu and Bugliarello et al., EMNLP 2021) [Original code]
- 09-2021: Added code for
cross-modal ablation
(Frank and Bugliarello et al., EMNLP 2021) [Original code]
Repository Setup
You can clone this repository with submodules included issuing:
git clone [email protected]:e-bug/volta
1. Create a fresh conda environment, and install all dependencies.
conda create -n volta python=3.6
conda activate volta
pip install -r requirements.txt
2. Install PyTorch
conda install pytorch=1.4.0 torchvision=0.5 cudatoolkit=10.1 -c pytorch
3. Install apex. If you use a cluster, you may want to first run commands like the following:
module load cuda/10.1.105
module load gcc/8.3.0-cuda
4. Setup the refer
submodule for Referring Expression Comprehension:
cd tools/refer; make
5. Install this codebase as a package in this environment.
python setup.py develop
Data
Check out data/README.md
for links to preprocessed data and data preparation steps.
features_extraction/
contains our latest feature extraction steps in hdf5
and npy
instead of csv
, and with different backbones. Steps for the IGLUE datasets can be found in its datasets sub-directory.
NB: I have noticed that uploading LMDB files made their size grow to the order of TBs. So, instead, I recently uploaded the H5 versions that can quickly be converted to LMDB locally using this script.
Models
Check out MODELS.md
for links to pretrained models and how to define new ones in VOLTA.
Model configuration files are stored in config/.
Training and Evaluation
We provide sample scripts to train (i.e. pretrain or fine-tune) and evaluate models in examples/. These include ViLBERT, LXMERT and VL-BERT as detailed in the original papers, as well as ViLBERT, LXMERT, VL-BERT, VisualBERT and UNITER as specified in our controlled study.
Task configuration files are stored in config_tasks/.
License
This work is licensed under the MIT license. See LICENSE
for details.
Third-party software and data sets are subject to their respective licenses.
If you find our code/data/models or ideas useful in your research, please consider citing the paper:
@article{bugliarello-etal-2021-multimodal,
author = {Bugliarello, Emanuele and Cotterell, Ryan and Okazaki, Naoaki and Elliott, Desmond},
title = "{Multimodal Pretraining Unmasked: {A} Meta-Analysis and a Unified Framework of Vision-and-Language {BERT}s}",
journal = {Transactions of the Association for Computational Linguistics},
volume = {9},
pages = {978-994},
year = {2021},
month = {09},
issn = {2307-387X},
doi = {10.1162/tacl_a_00408},
url = {https://doi.org/10.1162/tacl\_a\_00408},
}
Acknowledgement
Our codebase heavily relies on these excellent repositories: