• Stars
    star
    105
  • Rank 328,117 (Top 7 %)
  • Language
    Python
  • License
    BSD 2-Clause "Sim...
  • Created almost 8 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Supplementary material to "Top-down Visual Saliency Guided by Captions" (CVPR 2017)

Caption-Guided Saliency

This code is released as a supplementary material to "Top-down Visual Saliency Guided by Captions" (CVPR 2017).

Getting started

Clone this repo (including coco-caption as a submodule):

$ git clone --recursive [email protected]:VisionLearningGroup/caption-guided-saliency.git

Install dependencies

The model is implemented using TensorFlow framework, Python 2.7. For TensorFlow installation please refer to the official Installing TensorFlow guide or simply:

$ pip install --upgrade https://storage.googleapis.com/tensorflow/linux/gpu/tensorflow_gpu-1.1.0-cp27-none-linux_x86_64.whl

Warning! The standard version of TensorFlow gives the warnings like:

The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.

It's fine. To get rid of them you'll need to build TensorFlow from sources with --config=opt.

List of other required python modules:

$ pip install tqdm numpy six pillow matplotlib scipy

The code also uses ffmpeg for data preprocessing.

Obtain the dataset you need:

and unpack files into their respective directories under ./DATA/.

Expected layout so far is:

./DATA/
    โ””โ”€โ”€โ”€MSR_VTT/
    โ”‚   โ”‚   test_videodatainfo.json
    โ”‚   โ”‚   train_val_videodatainfo.json
    โ”‚   โ”‚
    โ”‚   โ””โ”€โ”€โ”€TestVideo/
    โ”‚   โ”‚       ...
    โ”‚   โ”‚   
    โ”‚   โ””โ”€โ”€โ”€TrainValVideo/
    โ”‚           ...
    โ””โ”€โ”€โ”€Flickr30k
        โ”‚   results_20130124.token
        โ”‚      
        โ””โ”€โ”€โ”€flickr30k-images/
                ...

Run data preprocessing

$ python preprocessing.py --dataset {MSR-VTT|Flickr30k}

This step takes ~30mins for Flickr30k and ~2h for MSR-VTT.

Run training

$ python run_s2vt.py --dataset {MSR-VTT|Flickr30k} --train

We do not finetune CNN part of the model, thus, training on GPU takes only several hours. Training/validation/test splits for Flickr30k are taken from NeuralTalk. After the training you can run evaluation of the model:

$ python run_s2vt.py --dataset {MSR-VTT|Flickr30k} --test --checkpoint {number}

Saliency Visualization

After you got the model which was trained to produce captions for MSR-VTT dataset, you can get video with saliency visualization similar to those in the beginning of the readme:

$ python visualization.py --dataset MSR-VTT     \
                          --media_id video9461  \
                          --checkpoint {number} \
                          --sentence "A man is driving a car"

where media_id should belong to the test split of MSR-VTT, sentence sets a query phrase.

What's next

You can change model's parameters (dimensionality of layers, learning rate etc.) directly in cfg.py. Every run of run_s2vt.py with --train switch will overwrite files in experiments directory.

References

If you find this useful in your work please consider citing:

@inproceedings{Ramanishka2017cvpr,
          title = {Top-down Visual Saliency Guided by Captions},
          author = {Vasili Ramanishka and Abir Das and Jianming Zhang and Kate Saenko},
          booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
          year = {2017}
          }

More Repositories

1

DA_Detection

Implementation of "Strong-Weak Distribution Alignment for Adaptive Object Detection"
Python
345
star
2

SSDA_MME

Semi-supervised Domain Adaptation via Minimax Entropy
Python
295
star
3

R-C3D

code for R-C3D
Jupyter Notebook
251
star
4

CORAL

Correlation Alignment for Domain Adaptation
186
star
5

VisionLearningGroup.github.io

CSS
183
star
6

taskcv-2017-public

Python
168
star
7

DAL

Domain agnostic learning with disentangled representations
Python
135
star
8

DANCE

repository for Universal Domain Adaptation through Self-supervision
Python
123
star
9

OVANet

Repository for OVANet
Python
82
star
10

visda-2019-public

Python
58
star
11

OP_Match

Python
56
star
12

visda-2018-public

Python
45
star
13

visda21-dev

Python
45
star
14

Text-to-Clip_Retrieval

Implementation for "Multilevel Language and Vision Integration for Text-to-Clip Retrieval"
Jupyter Notebook
44
star
15

CDS

CDS: Cross-Domain Self-supervised Pre-training
Python
26
star
16

Ask_Attend_and_Answer

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering
C++
23
star
17

JEDDi-Net

Implementation for "Joint Event Detection and Description in Continuous Video Streams"
Jupyter Notebook
22
star
18

UDA_PoseEstimation

Python
19
star
19

MULE

Implementation of "MULE: Multimodal Universal Language Embedding"
Python
14
star
20

Benchmark_Domain_Transfer

Python
13
star
21

SND

Python
13
star
22

Domain2Vec

7
star
23

taskcv_2018_public

Python
4
star
24

mind_back

repository for a paper, Mind the Backbone: Minimizing Backbone Distortion for Robust Object Detection
Python
4
star
25

MMVD

Multimodal Video Description
3
star
26

visda-2018

HTML
3
star
27

SANE

Pytorch implementation of our ECCV paper
Python
2
star
28

visda-2019

CSS
1
star
29

Learning-Similarity-Conditions

Python
1
star
30

Ground2Sky

1
star