• Stars
    star
    282
  • Rank 146,549 (Top 3 %)
  • Language
    Python
  • License
    BSD 3-Clause "New...
  • Created almost 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions. CVPR 2019

Show, Control and Tell

This repository contains the reference code for the paper Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions (CVPR 2019).

Please cite with the following BibTeX:

@inproceedings{cornia2019show,
  title={{Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions}},
  author={Cornia, Marcella and Baraldi, Lorenzo and Cucchiara, Rita},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2019}
}

sample results

Environment setup

Clone the repository and create the sct conda environment using the conda.yml file:

conda env create -f conda.yml
conda activate sct

Our code is based on SpeakSee: a Python package that provides utilities for working with Visual-Semantic data, developed by us. The conda enviroment we provide already includes a beta version of this package.

Data preparation

COCO Entities

Download the annotations and metadata file dataset_coco.tgz (~85.6 MB) and extract it in the code folder using tar -xzvf dataset_coco.tgz.

Download the pre-computed features file coco_detections.hdf5 (~53.5 GB) and place it under the datasets/coco folder, which gets created after decompressing the annotation file.

Flickr30k Entities

As before, download the annotations and metadata file dataset_flickr.tgz (~32.8 MB) and extract it in the code folder using tar -xzvf dataset_flickr.tgz.

Download the pre-computed features file flickr30k_detections.hdf5 (~13.1 GB) and place it under the datasets/flickr folder, which gets created after decompressing the annotation file.

Evaluation

To reproduce the results in the paper, download the pretrained model file saved_models.tgz (~4 GB) and extract it in the code folder with tar -xzvf saved_models.tgz.

Sequence controllability

Run python test_region_sequence.py using the following arguments:

Argument Possible values
--dataset coco, flickr
--exp_name ours, ours_without_visual_sentinel, ours_with_single_sentinel
--sample_rl If used, tests the model with CIDEr optimization
--sample_rl_nw If used, tests the model with CIDEr + NW optimization
--batch_size Batch size (default: 16)
--nb_workers Number of workers (default: 0)

For example, to reproduce the results of our full model trained on COCO-Entities with CIDEr+NW optimization (Table 2, bottom right), use:

python test_region_sequence.py --dataset coco --exp_name ours --sample_rl_nw  

Set controllability

Run python test_region_set.py using the following arguments:

Argument Possible values
--dataset coco, flickr
--exp_name ours, ours_without_visual_sentinel, ours_with_single_sentinel
--sample_rl If used, tests the model with CIDEr optimization
--sample_rl_nw If used, tests the model with CIDEr + NW optimization
--batch_size Batch size (default: 16)
--nb_workers Number of workers (default: 0)

For example, to reproduce the results of our full model trained on COCO-Entities with CIDEr+NW optimization (Table 4, bottom row), use:

python test_region_set.py --dataset coco --exp_name ours --sample_rl_nw  

Expected output

Under logs/, you may also find the expected output of all experiments.

Training procedure

Run python train.py using the following arguments:

Argument Possible values
--exp_name Experiment name
--batch_size Batch size (default: 100)
--lr Initial learning rate (default: 5e-4)
--nb_workers Number of workers (default: 0)
--sample_rl If used, the model will be trained with CIDEr optimization
--sample_rl_nw If used, the model will be trained with CIDEr + NW optimization

For example, to train the model with cross entropy, use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-4 

To train the model with CIDEr optimization (after training the model with cross entropy), use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-5 --sample_rl

To train the model with CIDEr + NW optimization (after training the model with cross entropy), use:

python train.py --exp_name show_control_and_tell --batch_size 100 --lr 5e-5 --sample_rl_nw

Note: the current training code only supports the use of the COCO Entities dataset.

model

COCO Entities

If you want to use only the annotations of our COCO Entities dataset, you can download the annotation file coco_entities_release.json (~403 MB).

The annotation file contains a python dictionary structured as follows:

coco_entities_release.json
 โ””โ”€โ”€ <id_image>
      โ””โ”€โ”€ <caption>
           โ””โ”€โ”€ 'det_sequences'
           โ””โ”€โ”€ 'noun_chunks'
           โ””โ”€โ”€ 'detections'
           โ””โ”€โ”€ 'split'

In details, for each image-caption pair, we provide the following information:

  • det_sequences, which contains a list of detection classes associated to each word of the caption (for an exact match with caption words, split the caption by spaces). None indicates the words that are not part of noun chunks, while _ indicates noun chunk words for which an association with a detection in the image was not possible.
  • noun_chunks, which is a list of tuples representing the noun chunks of the captions associated with a detection in the image. Each tuple is composed by two elements: the first one represents the noun chunk in the caption, while the second is the detection class associated to that noun chunk.
  • detections, which contains a dictionary with a number of elements equal to the number of detection classes associated with at least a noun chunk in the caption. For each detection class, it provides a list of tuples representing the image regions detected by Faster R-CNN re-trained on Visual Genome [1] and corresponding to that detection class. Each tuple is composed by the detection id and the corresponding boundig box in the form [x1, y1, x2, y2]. The detection id can be used to recover the detection feature vector from the pre-computed features file coco_detections.hdf5 (~53.5 GB). See the demo section below for more details.
  • split, which indicates the dataset split of that sample (i.e. train, val or test) following the COCO splits provided by [2].

Note that this annotation file includes all image-caption pairs for which at least one noun chunk-detection association has been found. However, in validation and testing phase of our controllable captioning model, we dropped all captions with empty region sets (i.e. those captions with at least one _ in the det_sequences field).

coco entities

By downloading the dataset, you declare that you will use it for research and educational purposes only, any commercial use is prohibited.

Demo

An example of how to use the COCO Entities annotations can be found in the coco_entities_demo.ipynb file.

References

[1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018.

[2] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.

Contact

If you have any general doubt about our work, please use the public issues section on this github repo. Alternatively, drop us an e-mail at marcella.cornia [at] unimore.it or lorenzo.baraldi [at] unimore.it.

More Repositories

1

mammoth

An Extendible (General) Continual Learning Framework based on Pytorch - official codebase of Dark Experience for General Continual Learning
Python
532
star
2

meshed-memory-transformer

Meshed-Memory Transformer for Image Captioning. CVPR 2020
Python
518
star
3

dress-code

Dress Code: High-Resolution Multi-Category Virtual Try-On. ECCV 2022
Python
477
star
4

multimodal-garment-designer

This is the official repository for the paper "Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing". ICCV 2023
Python
402
star
5

novelty-detection

Latent space autoregression for novelty detection.
Python
196
star
6

LLaVA-MORE

LLaVA-MORE: Enhancing Visual Instruction Tuning with LLaMA 3.1
Python
82
star
7

art2real

Art2Real: Unfolding the Reality of Artworks via Semantically-Aware Image-to-Image Translation. CVPR 2019
Python
78
star
8

VKD

PyTorch code for ECCV 2020 paper: "Robust Re-Identification by Multiple Views Knowledge Distillation"
Python
73
star
9

VATr

Python
70
star
10

open-fashion-clip

This is the official repository for the paper "OpenFashionCLIP: Vision-and-Language Contrastive Learning with Open-Source Fashion Data". ICIAP 2023
Python
52
star
11

pacscore

Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation. CVPR 2023
Python
51
star
12

STAGE_action_detection

Code of the STAGE module for video action detection
Python
49
star
13

human-pose-annotation-tool

Human Pose Annotation Tool
Python
39
star
14

mil4wsi

DAS-MIL: Distilling Across Scales for MILClassification of Histological WSIs
Python
37
star
15

safe-clip

Safe-CLIP: Removing NSFW Concepts from Vision-and-Language Models. ECCV 2024
Python
33
star
16

awesome-human-visual-attention

This repository contains a curated list of research papers and resources focusing on saliency and scanpath prediction, human attention, human visual search.
32
star
17

TransformerBasedGestureRecognition

Python
31
star
18

speaksee

PyTorch library for Visual-Semantic tasks
Python
28
star
19

camel

CaMEL: Mean Teacher Learning for Image Captioning. ICPR 2022
Python
26
star
20

Ti-MGD

This is the official repository for the paper "Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing".
24
star
21

RefiNet

Python
23
star
22

mvad-names-dataset

M-VAD Names Dataset. Multimedia Tools and Applications (2019)
Python
21
star
23

DynamicConv-agent

PyTorch code for BMVC 2019 paper: Embodied Vision-and-Language Navigation with Dynamic Convolutional Filters
C++
21
star
24

perceive-transform-and-act

PyTorch code for the paper: "Perceive, Transform, and Act: Multi-Modal Attention Networks for Vision-and-Language Navigation"
C++
19
star
25

freeda

FreeDA: Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation (CVPR 2024)
Python
19
star
26

CoDE

[ECCV'24] Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities
19
star
27

mcmr

PyTorch code for 3DV 2021 paper: "Multi-Category Mesh Reconstruction From Image Collections"
Python
18
star
28

PMA-Net

With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning. ICCV 2023
Python
16
star
29

MaPeT

Learning to Mask and Permute Visual Tokens for Vision Transformer Pre-Training
Python
15
star
30

focus-on-impact

Python
15
star
31

LiDER

Official implementation of "On the Effectiveness of Lipschitz-Driven Rehearsal in Continual Learning"
Python
15
star
32

HWD

Python
15
star
33

LoCoNav

Python
13
star
34

CSL-TAL

Pytorch code for ECCVW 2022 paper "Consistency-based Self-supervised Learning for Temporal Anomaly Localization"
Python
12
star
35

Alfie

Democratising RGBA Image Generation With No $$$ (AI4VA@ECCV24)
Python
11
star
36

DiCO

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization (BMVC 2024)
Python
10
star
37

COCOFake

10
star
38

FourBi

Binarizing Documents by Leveraging both Space and Frequency. (ICDAR 2024)
Python
10
star
39

bridge-score

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues. ECCV 2024
10
star
40

RMSNet_Soccer

PyTorch code for RMS-Net
Python
8
star
41

ADCC

Python
8
star
42

mugat

Official implementation of our ECCVW paper "ฮผgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"
Python
6
star
43

aimagelab-srv

AImageLab-SRV wiki, support, code snippets and best practices.
6
star
44

CSSL

Code implementation for "Continual Semi-Supervised Learning through Contrastive Interpolation Consistency"
Python
6
star
45

rpe_spdh

PyTorch code for IEEE RA-L paper: "Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from Depth Maps"
Python
5
star
46

MAD

Official PyTorch implementation for "Semantically Coherent Montages by Merging and Splitting Diffusion Paths", presenting the Merge-Attend-Diffuse operator (ECCV24)
Python
5
star
47

vffc

Python
4
star
48

LAM

The Ludovico Antonio Muratori (LAM) dataset is the largest line-level HTR dataset to date and contains 25,823 lines from Italian ancient manuscripts edited by a single author over 60 years. The dataset comes in two configurations: a basic splitting and a date-based splitting which takes into account the age of the author. The first setting is intended to study HTR on ancient documents in Italian, while the second focuses on the ability of HTR systems to recognize text written by the same writer in time periods for which training data are not available.
4
star
49

aidlda_tutorial

A tutorial on PyTorch - AI-DLDA 2018
Python
3
star
50

Emuru

Python
3
star
51

unveiling-the-truth

Python
2
star
52

DefConvs_HTR

Boosting modern and historical handwritten text recognition with deformable convolutions (ICPR20, IJDAR22)
Python
2
star
53

cvcs2023

1
star
54

Teddy

Python
1
star
55

FourBi_old

Python
1
star
56

CaSpeR

Code implementation for "Latent Spectral Regularization for Continual Learning"
Python
1
star