X2-VLM with a modular architecture performs the best on base and large scale for both image-text and video-text tasks, making a good trade-off between performance and model scale. We also show that the modular design of X2-VLM results in high transferability for X2-VLM to be utilized in any language or domain. For example, by simply replacing the text encoder with XLM-R, X-VLM outperforms state-of-the-art multilingual multi-modal pre-trained models without any multilingual pre-training.
- Jun 2023: Release official PyTorch implementation and checkpoints
- Nov 2022: Release preprint in arxiv.
- Support several backbones
- vision encoder: beit / clip-vit / swin-transformer
- text encoder: bert / roberta
- Support apex O1 / O2 for pre-training
- Read from and write to HDFS
- Distributed training across nodes for both pre-training and fine-tuning
Please read the code for more details.
- Install python3 environment
pip3 install -r requirements.txt
- Download raw images from corresponding websites
- Download the json files we provided, which contains image read paths and captions and/or bbox annotations
- If running pre-training scripts:
- install Apex
- download pre-trained models for parameter initialization
- image encoder: beit2
- text encoder: bert
# X-VLM pretrain
python3 run.py --task "pretrain_DIY" --dist "all" --config "configs/pretrain/x2vlm_base_4m.yaml" --output_dir "output/tmp"
# CCLM multilingual multimodal pretrain
python3 run.py --task "pretrain_DIY" --dist "all" --config "configs/pretrain/multilingual_cclm_x2vlm_base.yaml" --checkpoint "path/to/x2vlm_base_1b.th" --output_dir "output/tmp"
See run.py and configs/pretrain for more details.
All datasets we utilized are public available. Please prepare the pre-training data by yourself. Read the code dataset/pretrain_dataset.py (more specifically ImageTextJsonDataset & RegionTextJsonDataset) to see what format is needed.
The processed COCO & VG annotations can be downloaded here.
Please make sure all parameters are loaded correctly.
X2VLM-base (4M)
X2VLM-large (4M)
X2VLM-base (1B)
X2VLM-large (1B)
CCLM-X2VLM-base
CCLM-X2VLM-large
All datasets are publicly available. Some datasets can be downloaded here.
We have released all codes. However, now we only provide parts of fine-tuned ckpts (and training configs and logs).
vqa-base
vqa-large
captioning-large
refcoco-bbox-large
It takes time for us to retrieve our previous training logs. If you need more, please submit a Github issue and we will return to your request later.
coco-retrieval-base-rerun
coco-retrieval-large-rerun
# train
python3 run.py --task "vqa" --dist "all" --config "configs/finetune/vqa2_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th" --output_dir "output/tmp"
python3 run.py --task "refcoco_bbox" --dist "all" --config "configs/finetune/refcoco_grounding_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th" --output_dir "output/tmp"
python3 run.py --task "coco_captioning_mlm" --dist "all" --config "configs/finetune/coco_captioning_large.yaml" --checkpoint "x2vlm_ckpts_2release/x2vlm_large_1b.th" --output_dir "output/tmp"
We release all training codes. Specify "--task" and "--config" to finetune on other tasks. See run.py for details.
If you find this repository useful, please considering giving β or citing:
@article{zeng2022x,
title={X $\^{} 2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks},
author={Zeng, Yan and Zhang, Xinsong and Li, Hang and Wang, Jiawei and Zhang, Jipeng and Zhou, Wangchunshu},
journal={arXiv preprint arXiv:2211.12402},
year={2022}
}
@article{zeng2022cross,
title={Cross-view language modeling: Towards unified cross-lingual cross-modal pre-training},
author={Zeng, Yan and Zhou, Wangchunshu and Luo, Ao and Zhang, Xinsong},
journal={arXiv preprint arXiv:2206.00621},
year={2022}
}
@article{xvlm,
title={Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts},
author={Zeng, Yan and Zhang, Xinsong and Li, Hang},
journal={arXiv preprint arXiv:2111.08276},
year={2021}
}
For issues using this code, please submit a GitHub issue.