• Stars
    star
    124
  • Rank 286,621 (Top 6 %)
  • Language
    C++
  • License
    Other
  • Created about 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code release for Fried et al., Speaker-Follower Models for Vision-and-Language Navigation. in NeurIPS, 2018.

Speaker-Follower Models for Vision-and-Language Navigation

This repository contains the code for the following paper:

  • D. Fried*, R. Hu*, V. Cirik*, A. Rohrbach, J. Andreas, L.-P. Morency, T. Berg-Kirkpatrick, K. Saenko, D. Klein**, T. Darrell**, Speaker-Follower Models for Vision-and-Language Navigation. in NeurIPS, 2018. (PDF)
@inproceedings{fried2018speaker,
  title={Speaker-Follower Models for Vision-and-Language Navigation},
  author={Fried, Daniel and Hu, Ronghang and Cirik, Volkan and Rohrbach, Anna and Andreas, Jacob and Morency, Louis-Philippe and Berg-Kirkpatrick, Taylor and Saenko, Kate and Klein, Dan and Darrell, Trevor},
  booktitle={Neural Information Processing Systems (NeurIPS)},
  year={2018}
}

(*, **: indicates equal contribution)

Project Page: http://ronghanghu.com/speaker_follower

Augmented Data on R2R

If you only want to use our data augmentation on the R2R dataset but don't need our models, you can directly download our augmented data on R2R (JSON file containing synthetic data generated by our speaker model) here. This JSON file is in the same format as the original R2R dataset, with one synthetic instruction per sampled new trajectory.

Note that we first trained on the combination of the original and the augmented data, and then fine-tuned on the original training data.

Installation

  1. Install Python 3 (Anaconda recommended: https://www.continuum.io/downloads).
  2. Install PyTorch following the instructions on https://pytorch.org/ (we used PyTorch 0.3.1 in our experiments).
  3. Download this repository or clone recursively with Git, and then enter the root directory of the repository:
# Make sure to clone with --recursive
git clone --recursive https://github.com/ronghanghu/speaker_follower.git
cd speaker_follower

If you didn't clone with the --recursive flag, then you'll need to manually clone the pybind submodule from the top-level directory:

git submodule update --init --recursive
  1. Install the dependencies for the Matterport3D Simulator:
sudo apt-get install libopencv-dev python-opencv freeglut3 freeglut3-dev libglm-dev libjsoncpp-dev doxygen libosmesa6-dev libosmesa6 libglew-dev
  1. Compile the Matterport3D Simulator:
mkdir build && cd build
cmake ..
make
cd ../

Note: This repository is built upon the Matterport3DSimulator codebase. Additional details on the Matterport3D Simulator can be found in README_Matterport3DSimulator.md.

Train and evaluate on the Room-to-Room (R2R) dataset

Download and preprocess the data

  1. Download the Precomputing ResNet Image Features, and extract them into img_features/:
mkdir -p img_features/
cd img_features/
wget https://www.dropbox.com/s/o57kxh2mn5rkx4o/ResNet-152-imagenet.zip?dl=1 -O ResNet-152-imagenet.zip
unzip ResNet-152-imagenet.zip
cd ..

(In case the URL above doesn't work, it is likely because the Room-to-Room dataset changed its feature URLs. You can find the latest download links here.)

After this step, img_features/ should contain ResNet-152-imagenet.tsv. (Note that you only need to download the features extracted from ImageNet-pretrained ResNet to run the following experiments. Places-pretrained ResNet features or actual images are not required.)

  1. Download the R2R dataset and our sampled trajectories for data augmentation:
./tasks/R2R/data/download.sh

Training

  1. Train the speaker model:
python tasks/R2R/train_speaker.py
  1. Generate synthetic instructions from the trained speaker model as data augmentation:
# the path prefix to the speaker model (trained in Step 1 above)
export SPEAKER_PATH_PREFIX=tasks/R2R/speaker/snapshots/speaker_teacher_imagenet_mean_pooled_train_iter_20000

python tasks/R2R/data_augmentation_from_speaker.py \
    $SPEAKER_PATH_PREFIX \
    tasks/R2R/data/R2R

After this step, R2R_literal_speaker_data_augmentation_paths.json be generated under tasks/R2R/data/. This JSON file contains synthetic instructions generated by the speaker model on sampled new trajectories in the train environment (i.e. the speaker-driven data augmentation in our paper).

Alternatively, you can directly download our precomputed speaker-driven data augmentation with ./tasks/R2R/data/download_precomputed_augmentation.sh.

  1. Train the follower model on the combination of the original and the augmented training data.
python tasks/R2R/train.py \
  --use_pretraining --pretrain_splits train literal_speaker_data_augmentation_paths

The follower will be first trained on the combination of the original train environment and the new literal_speaker_data_augmentation_paths (generated in Step 2 above) for 50000 iterations, and then fine-tuned on the original train environment for 20000 iterations. This step may take a long time. (It look approximately 50 hours using a single GPU on our local machine.)

Note

  • All the commands above run on a single GPU. You may choose a specific GPU by setting CUDA_VISIBLE_DEVICES environment variable (e.g. export CUDA_VISIBLE_DEVICES=1 to use GPU 1).
  • You may directly download our trained speaker model and follower model with
./tasks/R2R/snapshots/release/download_speaker_release.sh  # Download speaker
./tasks/R2R/snapshots/release/download_follower_release.sh  # Download follower

The scripts above will save the downloaded models under ./tasks/R2R/snapshots/release/. To use these downloaded models, set the speaker and follower path prefixes as follows:

export SPEAKER_PATH_PREFIX=tasks/R2R/snapshots/release/speaker_final_release
export FOLLOWER_PATH_PREFIX=tasks/R2R/snapshots/release/follower_final_release
  • One can also train the follower only on the original training data without using the augmented data from the speaker as follows:
python tasks/R2R/train.py

Inference

  1. Set the path prefixes for the trained speaker and follower model:
# the path prefixes to the trained speaker and follower model
# change these path prefixes if you are using downloaded models.
export SPEAKER_PATH_PREFIX=tasks/R2R/speaker/snapshots/speaker_teacher_imagenet_mean_pooled_train_iter_20000
export FOLLOWER_PATH_PREFIX=tasks/R2R/snapshots/follower_with_pretraining_sample_imagenet_mean_pooled_train_iter_11100
  1. Generate top-ranking trajectory predictions with pragmatic inference:
# Specify the path prefix to the output evaluation file
export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/pragmatics

python tasks/R2R/rational_follower.py \
    $FOLLOWER_PATH_PREFIX $SPEAKER_PATH_PREFIX \
    --batch_size 15 --beam_size 40 --state_factored_search \
    --use_test_set \
    --eval_file $EVAL_FILE_PREFIX

This will generate the prediction files in the directory of EVAL_FILE_PREFIX, and also print the performance on val_seen and val_unseen splits. (The displayed performance will be zero on the test split, since the test JSON file does not contain ground-truth target locations.) The predicted trajectories with the above script contain only the top-scoring trajectories among all candidate trajectories, ranked with pragmatic inference. The expected success rates are 70.1% and 54.6% on val_seen and val_unseen, respectively.

  1. For participating in the Vision-and-Language Navigation Challenge, add --physical_traversal option to generate physically-plausible trajectory predictions with pragmatic inference:
# Specify the path prefix to the output evaluation file
export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/pragmatics_physical

python tasks/R2R/rational_follower.py \
    $FOLLOWER_PATH_PREFIX $SPEAKER_PATH_PREFIX \
    --batch_size 15 --beam_size 40 --state_factored_search \
    --use_test_set --physical_traversal \
    --eval_file $EVAL_FILE_PREFIX

This will generate the prediction files in the directory of EVAL_FILE_PREFIX. These prediction files can be submitted to https://evalai.cloudcv.org/web/challenges/challenge-page/97/overview for evaluation. The expected success rate on the challenge test set is 53.5%.

The major difference with --physical_traversal is that now the generated trajectories contain all states visited by the search algorithm in the order they are traversed. The agent expands each route one step forward at a time, and then switches to expand the next route. The details are explained in Appendix E in our paper.

  1. In addition, it is also possible to evaluate the performance of the follower alone, using greedy decoding (without pragmatic inference from the speaker):
export EVAL_FILE_PREFIX=tasks/R2R/eval_outputs/greedy

python tasks/R2R/validate.py \
    $FOLLOWER_PATH_PREFIX \
    --batch_size 100 \
    --use_test_set \
    --eval_file $EVAL_FILE_PREFIX

This will generate the prediction files in the directory of EVAL_FILE_PREFIX, and also print the performance on val_seen and val_unseen splits. (The displayed performance will be zero on the test split, since the test JSON file does not contain ground-truth target locations.) The expected success rates are 66.4% and 35.5% on val_seen and val_unseen, respectively.

Acknowledgements

This repository is built upon the Matterport3DSimulator codebase.

More Repositories

1

seg_every_thing

Code release for Hu et al., Learning to Segment Every Thing. in CVPR, 2018.
Python
423
star
2

n2nmn

Code release for Hu et al. Learning to Reason: End-to-End Module Networks for Visual Question Answering. in ICCV, 2017
SourcePawn
270
star
3

tensorflow_compact_bilinear_pooling

Compact Bilinear Pooling in TensorFlow
Python
141
star
4

natural-language-object-retrieval

Code release for Hu et al. Natural Language Object Retrieval, in CVPR, 2016
Jupyter Notebook
112
star
5

lcgn

Code release for Hu et al., Language-Conditioned Graph Networks for Relational Reasoning. in ICCV, 2019
Python
90
star
6

text_objseg

Code release for Hu et al. Segmentation from Natural Language Expressions. in ECCV, 2016
Jupyter Notebook
86
star
7

snmn

Code release for Hu et al., Explainable Neural Computation via Stack Neural Module Networks. in ECCV, 2018
Python
71
star
8

cmn

Code release for Hu et al. Modeling Relationships in Referential Expressions with Compositional Modular Networks. in CVPR, 2017
Python
67
star
9

gqa_single_hop_baseline

A simple but well-performing "single-hop" visual attention model for the GQA dataset
Python
19
star
10

vit_10b_fsdp_example

See details in https://github.com/pytorch/xla/blob/r1.12/torch_xla/distributed/fsdp/README.md
Python
18
star
11

moco_v3_tpu

Python
16
star
12

vqa-maskrcnn-benchmark-m4c

Used in M4C feature extraction script: https://github.com/facebookresearch/mmf/blob/project/m4c/projects/M4C/scripts/extract_ocr_frcn_feature.py
Python
12
star
13

visualnet_label

An Online Tool for Rigid Object Landmark Labeling
JavaScript
4
star
14

SanguoshaEX

Sanguosha EX: An Open Source PC Game Based on Popular Desktop Game "Sanguosha"
C++
3
star
15

ptxla_scaling_examples

A list of examples for model scaling in PyTorch/XLA
2
star
16

mhex_graph

Modified Hierarchy-Exclusion Graph (MHEX Graph)
MATLAB
1
star
17

detectron2_vitdet

Python
1
star