• Stars
    star
    270
  • Rank 151,325 (Top 3 %)
  • Language SourcePawn
  • License
    BSD 2-Clause "Sim...
  • Created over 7 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code release for Hu et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering. in ICCV, 2017

Learning to Reason: End-to-End Module Networks for Visual Question Answering

This repository contains the code for the following paper:

  • R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko, Learning to Reason: End-to-End Module Networks for Visual Question Answering. in ICCV, 2017. (PDF)
@inproceedings{hu2017learning,
  title={Learning to Reason: End-to-End Module Networks for Visual Question Answering},
  author={Hu, Ronghang and Andreas, Jacob and Rohrbach, Marcus and Darrell, Trevor and Saenko, Kate},
  booktitle={Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
  year={2017}
}

Project Page: http://ronghanghu.com/n2nmn

Installation

  1. Install Python 3 (Anaconda recommended: https://www.continuum.io/downloads).
  2. Install TensorFlow v1.0.0 (Note: newer or older versions of TensorFlow may fail to work due to incompatibility with TensorFlow Fold):
    pip install tensorflow-gpu==1.0.0
  3. Install TensorFlow Fold (which is needed to run dynamic graph):
    pip install https://storage.googleapis.com/tensorflow_fold/tensorflow_fold-0.0.1-py3-none-linux_x86_64.whl
  4. Download this repository or clone with Git, and then enter the root directory of the repository:
    git clone https://github.com/ronghanghu/n2nmn.git && cd n2nmn

Train and evaluate on the CLEVR dataset

Download and preprocess the data

  1. Download the CLEVR dataset from http://cs.stanford.edu/people/jcjohns/clevr/, and symbol link it to exp_clevr/clevr-dataset. After this step, the file structure should look like
exp_clevr/clevr-dataset/
  images/
    train/
      CLEVR_train_000000.png
      ...
    val/
    test/
  questions/
    CLEVR_train_questions.json
    CLEVR_val_questions.json
    CLEVR_test_questions.json
  ...
  1. Extract visual features from the images and store them on the disk. In our experiments, we keep the original 480 x 320 image size in CLEVR, and use the pool5 layer output of shape the (1, 10, 15, 512) from VGG-16 network (feature stored as numpy array in HxWxC format). Then, construct the "expert layout" from ground-truth functional programs, and build image collections (imdb) for clevr. These procedures can be down as follows.
./exp_clevr/tfmodel/vgg_net/download_vgg_net.sh  # VGG-16 converted to TF

cd ./exp_clevr/data/
python extract_visual_features_vgg_pool5.py  # feature extraction
python get_ground_truth_layout.py  # construct expert policy
python build_clevr_imdb.py  # build image collections
cd ../../

The saved features will take up approximately 29GB disk space (for all images in CLEVR train, val and test).

Training

  1. Add the root of this repository to PYTHONPATH: export PYTHONPATH=.:$PYTHONPATH

  2. Train with ground-truth layout (cloning expert + policy search after cloning)

    • Step a (cloning expert):
      python exp_clevr/train_clevr_gt_layout.py
    • Step b (policy search after cloning):
      python exp_clevr/train_clevr_rl_gt_layout.py
      which is by default initialized from exp_clevr/tfmodel/clevr_gt_layout/00050000 (the 50000-iteration snapshot in Step a). If you want to initialize from another snapshot, use the --pretrained_model flag to specify the snapshot path.
  3. Train without ground-truth layout (policy search from scratch)
    python exp_clevr/train_clevr_scratch.py

Note:

Test

  1. Add the root of this repository to PYTHONPATH: export PYTHONPATH=.:$PYTHONPATH

  2. Evaluate clevr_gt_layout (cloning expert):
    python exp_clevr/eval_clevr.py --exp_name clevr_gt_layout --snapshot_name 00050000 --test_split val
    Expected accuracy: 78.9% (on val split).

  3. Evaluate clevr_rl_gt_layout (policy search after cloning):
    python exp_clevr/eval_clevr.py --exp_name clevr_rl_gt_layout --snapshot_name 00050000 --test_split val
    Expected accuracy: 83.6% (on val split).

  4. Evaluate clevr_scratch (policy search from scratch):
    python exp_clevr/eval_clevr.py --exp_name train_clevr_scratch --snapshot_name 00100000 --test_split val
    Expected accuracy: 69.1% (on val split).

Note:

  • The above evaluation scripts will print out the accuracy (only for val split) and also save it under exp_clevr/results/. It will also save a prediction output file under exp_clevr/eval_outputs/.
  • By default, the above scripts use GPU 0, and evaluate on the validation split of CLEVR. To evaluate on a different GPU, set the --gpu_id flag.
  • To evaluate on the test split, use --test_split tst instead. As there is no ground-truth answers for test split in the downloaded CLEVR data, the evaluation script above will print out zero accuracy on the test split. You may email the prediction outputs in exp_clevr/eval_outputs/ to the CLEVR dataset authors for the test split accuracy.

Train and evaluate on the VQA dataset

Download and preprocess the data

  1. Download the VQA dataset annotations from http://www.visualqa.org/download.html, and symbol link it to exp_vqa/vqa-dataset. After this step, the file structure should look like
exp_vqa/vqa-dataset/
  Questions/
    OpenEnded_mscoco_train2014_questions.json
    OpenEnded_mscoco_val2014_questions.json
    OpenEnded_mscoco_test-dev2015_questions.json
    OpenEnded_mscoco_test2015_questions.json
  Annotations/
    mscoco_train2014_annotations.json
    mscoco_val2014_annotations.json
  1. Download the COCO images from http://mscoco.org/, extract features from the images, and store them under exp_vqa/data/resnet_res5c/. In our experiments, we resize all the COCO images to 448 x 448, and use the res5c layer output of shape (1, 14, 14, 2048) from the ResNet-152 network pretrained on ImageNET classification (feature stored as numpy array in HxWxC format). In our experiments, we use the same ResNet-152 res5c features as in MCB, except that the extracted features are stored in NHWC format (instead of NCHW format used in MCB).

The saved features will take up approximately 307GB disk space (for all images in COCO train2014, val2014 and test2015). After feature extraction, the file structure for the features should look like

exp_vqa/data/resnet_res5c/
  train2014/
    COCO_train2014_000000000009.npy
    ...
  val2014/
    COCO_val2014_000000000042.npy
    ...
  test2015/
    COCO_test2015_000000000001.npy
    ...

where each of the *.npy file contains COCO image feature extracted from the res5c layer of the ResNet-152 network, which is a numpy array of shape (1, 14, 14, 2048) and float32 type, stored in HxWxC format.

  1. Build image collections (imdb) for VQA:
cd ./exp_vqa/data/
python build_vqa_imdb.py
cd ../../

Note: this repository already contains the parsing results from Stanford Parser for the VQA questions under exp_vqa/data/parse/new_parse (parsed using this script), with the converted ground-truth (expert) layouts under exp_vqa/data/gt_layout_*_new_parse.npy (converted using notebook exp_vqa/data/convert_new_parse_to_gt_layout.ipynb).

Training

Train with ground-truth layout:

  1. Add the root of this repository to PYTHONPATH: export PYTHONPATH=.:$PYTHONPATH
  2. Step a (cloning expert):
    python exp_vqa/train_vqa_gt_layout.py
  3. Step b (policy search after cloning):
    python exp_vqa/train_vqa_rl_gt_layout.py

Note:

Test

  1. Add the root of this repository to PYTHONPATH: export PYTHONPATH=.:$PYTHONPATH

  2. Evaluate on vqa_gt_layout (cloning expert):

    • (on test-dev2015 split):
      python exp_vqa/eval_vqa.py --exp_name vqa_gt_layout --snapshot_name 00040000 --test_split test-dev2015
    • (on test2015 split):
      python exp_vqa/eval_vqa.py --exp_name vqa_gt_layout --snapshot_name 00040000 --test_split test2015
  3. Evaluate on vqa_rl_gt_layout (policy search after cloning):

    • (on test-dev2015 split):
      python exp_vqa/eval_vqa.py --exp_name vqa_rl_gt_layout --snapshot_name 00040000 --test_split test-dev2015
    • (on test2015 split):
      python exp_vqa/eval_vqa.py --exp_name vqa_rl_gt_layout --snapshot_name 00040000 --test_split test2015

Note: the above evaluation scripts will not print out the accuracy, but will write the prediction outputs to exp_vqa/eval_outputs/, which can be uploaded to the evaluation sever (http://www.visualqa.org/roe.html) for evaluation. The expected accuacy of vqa_rl_gt_layout on test-dev2015 split is 64.9%.

Train and evaluate on the VQAv2 dataset

Download and preprocess the data

  1. Download the VQAv2 dataset annotations from http://www.visualqa.org/download.html, and symbol link it to exp_vqa/vqa-dataset. After this step, the file structure should look like
exp_vqa/vqa-dataset/
  Questions/
    v2_OpenEnded_mscoco_train2014_questions.json
    v2_OpenEnded_mscoco_val2014_questions.json
    v2_OpenEnded_mscoco_test-dev2015_questions.jso
    v2_OpenEnded_mscoco_test2015_questions.json
  Annotations/
    v2_mscoco_train2014_annotations.json
    v2_mscoco_val2014_annotations.json
    v2_mscoco_train2014_complementary_pairs.json
    v2_mscoco_val2014_complementary_pairs.json
  1. Download the COCO images from http://mscoco.org/, extract features from the images, and store them under exp_vqa/data/resnet_res5c/. In our experiments, we resize all the COCO images to 448 x 448, and use the res5c layer output of shape (1, 14, 14, 2048) from the ResNet-152 network pretrained on ImageNET classification (feature stored as numpy array in HxWxC format). In our experiments, we use the same ResNet-152 res5c features as in MCB, except that the extracted features are stored in NHWC format (instead of NCHW format used in MCB).

The saved features will take up approximately 307GB disk space (for all images in COCO train2014, val2014 and test2015). After feature extraction, the file structure for the features should look like

exp_vqa/data/resnet_res5c/
  train2014/
    COCO_train2014_000000000009.npy
    ...
  val2014/
    COCO_val2014_000000000042.npy
    ...
  test2015/
    COCO_test2015_000000000001.npy
    ...

where each of the *.npy file contains COCO image feature extracted from the res5c layer of the ResNet-152 network, which is a numpy array of shape (1, 14, 14, 2048) and float32 type, stored in HxWxC format.

  1. Build image collections (imdb) for VQAv2:
cd ./exp_vqa/data/
python build_vqa_v2_imdb.py
cd ../../

Note: this repository already contains the parsing results from Stanford Parser for the VQAv2 questions under exp_vqa/data/parse/new_parse_vqa_v2 (parsed using this script), with the converted ground-truth (expert) layouts under exp_vqa/data/v2_gt_layout_*_new_parse.npy.

Training

Train with ground-truth layout:

  1. Add the root of this repository to PYTHONPATH: export PYTHONPATH=.:$PYTHONPATH
  2. Step a (cloning expert):
    python exp_vqa/train_vqa2_gt_layout.py
  3. Step b (policy search after cloning):
    python exp_vqa/train_vqa2_rl_gt_layout.py

Note:

Test

  1. Add the root of this repository to PYTHONPATH: export PYTHONPATH=.:$PYTHONPATH

  2. Evaluate on vqa2_gt_layout (cloning expert):

    • (on test-dev2015 split):
      python exp_vqa/eval_vqa2.py --exp_name vqa2_gt_layout --snapshot_name 00080000 --test_split test-dev2015
    • (on test2015 split):
      python exp_vqa/eval_vqa2.py --exp_name vqa2_gt_layout --snapshot_name 00080000 --test_split test2015
  3. Evaluate on vqa2_rl_gt_layout (policy search after cloning):

    • (on test-dev2015 split):
      python exp_vqa/eval_vqa2.py --exp_name vqa2_rl_gt_layout --snapshot_name 00080000 --test_split test-dev2015
    • (on test2015 split):
      python exp_vqa/eval_vqa2.py --exp_name vqa2_rl_gt_layout --snapshot_name 00080000 --test_split test2015

Note: the above evaluation scripts will not print out the accuracy, but will write the prediction outputs to exp_vqa/eval_outputs/, which can be uploaded to the evaluation sever (http://www.visualqa.org/roe.html) for evaluation. The expected accuacy of vqa2_rl_gt_layout on test-dev2015 split is 63.3%.

Train and evaluate on the SHAPES dataset

A copy of the SHAPES dataset is contained in this repository under exp_shapes/shapes_dataset. The ground-truth module layouts (expert layouts) we use in our experiments are also provided under exp_shapes/data/*_symbols.json. The script to obtain the expert layouts from the annotations is in exp_shapes/data/get_ground_truth_layout.ipynb.

Training

  1. Add the root of this repository to PYTHONPATH: export PYTHONPATH=.:$PYTHONPATH

  2. Train with ground-truth layout (behavioral cloning from expert):
    python exp_shapes/train_shapes_gt_layout.py

  3. Train without ground-truth layout (policy search from scratch):
    python exp_shapes/train_shapes_scratch.py

Note: by default, the above scripts use GPU 0. To train on a different GPU, set the --gpu_id flag. During training, the script will write TensorBoard events to exp_shapes/tb/ and save the snapshots under exp_shapes/tfmodel/.

Test

  1. Add the root of this repository to PYTHONPATH: export PYTHONPATH=.:$PYTHONPATH

  2. Evaluate shapes_gt_layout (behavioral cloning from expert):
    python exp_shapes/eval_shapes.py --exp_name shapes_gt_layout --snapshot_name 00040000 --test_split test

  3. Evaluate shapes_scratch (policy search from scratch):
    python exp_shapes/eval_shapes.py --exp_name shapes_scratch --snapshot_name 00400000 --test_split test

Note: the above evaluation scripts will print out the accuracy and also save it under exp_shapes/results/. By default, the above scripts use GPU 0, and evaluate on the test split of SHAPES. To evaluate on a different GPU, set the --gpu_id flag. To evaluate on the validation split, use --test_split val instead.

More Repositories

1

seg_every_thing

Code release for Hu et al., Learning to Segment Every Thing. in CVPR, 2018.
Python
423
star
2

tensorflow_compact_bilinear_pooling

Compact Bilinear Pooling in TensorFlow
Python
141
star
3

speaker_follower

Code release for Fried et al., Speaker-Follower Models for Vision-and-Language Navigation. in NeurIPS, 2018.
C++
124
star
4

natural-language-object-retrieval

Code release for Hu et al. Natural Language Object Retrieval, in CVPR, 2016
Jupyter Notebook
112
star
5

lcgn

Code release for Hu et al., Language-Conditioned Graph Networks for Relational Reasoning. in ICCV, 2019
Python
90
star
6

text_objseg

Code release for Hu et al. Segmentation from Natural Language Expressions. in ECCV, 2016
Jupyter Notebook
86
star
7

snmn

Code release for Hu et al., Explainable Neural Computation via Stack Neural Module Networks. in ECCV, 2018
Python
71
star
8

cmn

Code release for Hu et al. Modeling Relationships in Referential Expressions with Compositional Modular Networks. in CVPR, 2017
Python
67
star
9

gqa_single_hop_baseline

A simple but well-performing "single-hop" visual attention model for the GQA dataset
Python
19
star
10

vit_10b_fsdp_example

See details in https://github.com/pytorch/xla/blob/r1.12/torch_xla/distributed/fsdp/README.md
Python
18
star
11

moco_v3_tpu

Python
16
star
12

vqa-maskrcnn-benchmark-m4c

Used in M4C feature extraction script: https://github.com/facebookresearch/mmf/blob/project/m4c/projects/M4C/scripts/extract_ocr_frcn_feature.py
Python
12
star
13

visualnet_label

An Online Tool for Rigid Object Landmark Labeling
JavaScript
4
star
14

SanguoshaEX

Sanguosha EX: An Open Source PC Game Based on Popular Desktop Game "Sanguosha"
C++
3
star
15

ptxla_scaling_examples

A list of examples for model scaling in PyTorch/XLA
2
star
16

mhex_graph

Modified Hierarchy-Exclusion Graph (MHEX Graph)
MATLAB
1
star
17

detectron2_vitdet

Python
1
star