Language Models Can See: Plugging Visual Controls in Text Generation
Authors: Yixuan Su, Tian Lan, Yahui Liu, Fangyu Liu, Dani Yogatama, Yan Wang, Lingpeng Kong, and Nigel Collier
This repository contains code, models, and other related resources of our paper [Language Models Can See: Plugging Visual Controls in Text Generation].
⭐ If you are also interested in open-ended text generation and would like to see more details of our contrastive search decoding method, please refer to our SimCTG [paper] and [repo].
⭐ Replicate has provided a great web [demo] of MAGIC that is super easy to use and to interact with. Check it out!
Catalogue:
- 1. Introduction
- 2. News
- 3. Citation
- 4. Environment Setup
- 5. Zero-Shot Image Captioning
- 6. Visually Grounded Story Generation
- 7. Contact
- 8. MAGIC Elsewhere
1. Introduction:
Generative language models (LMs) such as GPT-2/3 can be prompted to generate text with remarkable quality. While they are designed for text-prompted generation, it remains an open question how the generation process could be guided by modalities beyond text such as images. In this work, we propose a training-free framework, called MAGIC (iMAge-Guided text generatIon with CLIP), for plugging in visual controls in the generation process and enabling LMs to perform multimodal tasks (e.g., image captioning) in a zero-shot manner. MAGIC is a simple yet efficient plug-and-play framework, which directly combines an off-the-shelf LM (i.e., GPT-2) and an image-text matching model (i.e., CLIP) for image-grounded text generation. During decoding, MAGIC influences the generation of the LM by introducing a CLIP-induced score, called magic score, which regularizes the generated result to be semantically related to a given image while being coherent to the previously generated context. Notably, the proposed decoding scheme does not involve any gradient update operation, therefore being computationally efficient. On the challenging task of zero-shot image captioning, MAGIC outperforms the state-of-the-art method by notable margins with a nearly 27 times decoding speedup. MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding. In the experiments, we showcase that it is also capable of performing visually grounded story generation given both an image and a text prompt.
2. News:
- [2022/05/06] MAGIC is publicly released!
3. Citation:
If you find our paper and resources useful, please kindly leave a star and cite our papers. Thanks!
@article{su2022language,
title={Language Models Can See: Plugging Visual Controls in Text Generation},
author={Su, Yixuan and Lan, Tian and Liu, Yahui and Liu, Fangyu and Yogatama, Dani and Wang, Yan and Kong, Lingpeng and Collier, Nigel},
journal={arXiv preprint arXiv:2205.02655},
year={2022}
}
@article{su2022contrastive,
title={A Contrastive Framework for Neural Text Generation},
author={Su, Yixuan and Lan, Tian and Wang, Yan and Yogatama, Dani and Kong, Lingpeng and Collier, Nigel},
journal={arXiv preprint arXiv:2202.06417},
year={2022}
}
4. Environment Setup:
python version: 3.8
pip3 install -r requirements.txt