Self-Monitoring Navigation Agent for Vision-and-Language Navigation
This is the PyTorch implementation of our paper:
Self-Monitoring Navigation Agent via Auxiliary Progress Estimation
Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib, Zsolt Kira,
Richard Socher, Caiming Xiong
International Conference on Learning Representations (ICLR), 2019
(Top 7% of reviews)
[arXiv] [GitHub] [Project] [OpenReview]
Follow-up work at CVPR 2019 (Oral)
Our follow-up work has been accepted at CVPR 2019 (Oral). Please check out here:
The Regretful Agent: Heuristic-Aided Navigation through Progress Estimation
[arXiv] [GitHub] [Project]
Abstract
The Vision-and-Language Navigation (VLN) task entails an agent following navigational instruction in photo-realistic unknown environments. This challenging task demands that the agent be aware of which instruction was completed, which instruction is needed next, which way to go, and its navigation progress towards the goal. In this paper, we introduce a self-monitoring agent with two complementary components: (1) visual-textual co-grounding module to locate the instruction completed in the past, the instruction required for the next action, and the next moving direction from surrounding images and (2) progress monitor to ensure the grounded instruction correctly reflects the navigation progress. We test our self- monitoring agent on a standard benchmark and analyze our proposed approach through a series of ablation studies that elucidate the contributions of the primary components. Using our proposed method, we set the new state of the art by a significant margin (8% absolute increase in success rate on the unseen test set).
Installation / Build Instructions
Prerequisites
A C++ compiler with C++11 support is required. Matterport3D Simulator has several dependencies:
- Ubuntu 14.04, 16.04, 18.04
- OpenCV >= 2.4 including 3.x
- OpenGL
- OSMesa
- GLM
- Numpy
- pybind11 for Python bindings
- Doxygen for building documentation
E.g. installing dependencies on Ubuntu:
sudo apt-get install libopencv-dev python-opencv freeglut3 freeglut3-dev libglm-dev libjsoncpp-dev doxygen libosmesa6-dev libosmesa6 libglew-dev
For installing dependencies on MacOS, please refer to Installation on MacOS.
Clone Repo
Clone the Self-Monitoring Agent repository:
# Make sure to clone with --recursive
git clone --recursive https://github.com/chihyaoma/selfmonitoring-agent.git
cd selfmonitoring-agent
If you didn't clone with the --recursive
flag, then you'll need to manually clone the pybind submodule from the top-level directory:
git submodule update --init --recursive
Note that our repository is based on the Matterport3DSimulator, which was originally proposed with the Room-to-Roon dataset.
Directory Structure
connectivity
: Json navigation graphs.img_features
: Storage for precomputed image features.data
: You create a symlink to the Matterport3D dataset.tasks
: Currently just the Room-to-Room (R2R) navigation task.
Other directories are mostly self-explanatory.
Dataset Download
Matterport3DSimulator comes with both RGB images or precomputed ResNet image features. For replicating our model performance, you will ONLY need the precomputed image features. You will however need the RGB images if you plan to visualize how the agent runs in the virtual environments.
Matterport3D Dataset
RGB images fromDownload the Matterport3D dataset which is available after requesting access here. The provided download script allows for downloading of selected data types. Note that for the Matterport3D Simulator, only the following data types are required (and can be selected with the download script):
matterport_skybox_images
Create a symlink to the Matterport3D Dataset, which should be structured as <Matterdata>/v1/scans/<scanId>/matterport_skybox_images/*.jpg
:
ln -s <Matterdata> data
Using symlinks will allow the same Matterport3D dataset installation to be used between multiple projects.
Matterport3DSimulator
Precomputing ResNet Image Features fromTo speed up model training times, it is convenient to discretize heading and elevation into 30 degree increments, and to precompute image features for each view.
We use the original precomputed image features as from Matterport3DSimulator. They provided image features with models pretrained from ImageNet and Places365.
Download and extract the tsv files into the img_features
directory. You will only need the ImageNet features to replicate our results. We will uppack the zip file later.
Empirically, we found that using features from Places365 performs similar to the model using ImageNet features.
Installation for R2R with PyTorch
Now that you have cloned the repo and download the image features needed. Let us unpack features and get things ready to run experiments.
Create Anaconda enviorment
# change "r2r" to any name you prefer, e.g., r2r-pytorch
conda create -n r2r python=3.6
Activate the enviorment you just created
source activate r2r
Install special requirements for the R2R dataset
pip install -r tasks/R2R-pano/requirements.txt
Install PyTorch for your Conda Env
Check the official PyTorch website for different CUDA version.
# with CUDA 10
conda install pytorch torchvision cuda100 -c pytorch
# MacOS without GPU
conda install pytorch torchvision -c pytorch
Download R2R dataset
Download the original data from MatterPort3DSimulator and the synthetic data for data augmentation proposed by Speaker-Follower in NeurIPS 2018.
# download dataset
./tasks/R2R-pano/data/download.sh
# download the synthetic data from Speaker-Follower
./tasks/R2R-pano/data/download_precomputed_augmentation.sh
# if you haven't already download the precomputed image features, otherwise skip this step
cd img_features
wget -O ResNet-152-imagenet.zip https://www.dropbox.com/s/o57kxh2mn5rkx4o/ResNet-152-imagenet.zip\?dl\=1
# unzip the file
unzip ResNet-152-imagenet.zip
cd ..
Compile the Matterport3D Simulator
Let us compile the simulator so that we can call its functions in python.
Build OpenGL version using CMake:
mkdir build && cd build
cmake ..
# Double-check if CMake find the proper path to your python
# if not, remove the make files and use the cmake with option below instead
rm -rf *
cmake -DPYTHON_EXECUTABLE:FILEPATH=/path/to/your/bin/python ..
make
cd ../
Or build headless OSMESA version using CMake:
mkdir build && cd build
cmake -DOSMESA_RENDERING=ON ..
make
cd ../
Running Tests on simulator
Now that the compilation is completed. Let us make sure the installation of simulator is successful and can run smoothly.
build/tests
Or, if you haven't installed the Matterport3D dataset, you will need to skip the rendering tests:
build/tests exclude:[Rendering]
Refer to the Catch documentation for additional usage and configuration options.
Minimum testing to see if the code can successfully run training.
python tasks/R2R-pano/main.py
Congradulations! You have completed the installation for the simulator and R2R dataset. You are now ready for training and reproducing our results.
Training and reproduce results
Train on real data
To replicate the performance reported in our paper, train proposed self-monitoring agent with:
# co-grounding + self-monitoring
CUDA_VISIBLE_DEVICES=0 python tasks/R2R-pano/main.py \
--exp_name 'cogrounding-selfmonitoring-agent' \
--batch_size 64 \
--img_fc_use_angle 1 \
--img_feat_input_dim 2176 \
--img_fc_dim 1024 \
--rnn_hidden_size 512 \
--eval_every_epochs 5 \
--use_ignore_index 1 \
--arch 'self-monitoring' \
--value_loss_weight 0.5 \
--monitor_sigmoid 0 \
--mse_sum 0 \
--fix_action_ended 0
Train on synthetic data & finetune on real data
Pre-train on synthetic data
# co-grounding + self-monitoring + pre-train on synthetic data
CUDA_VISIBLE_DEVICES=0 python tasks/R2R-pano/main.py \
--exp_name 'cogrounding-selfmonitoring-agent' \
--batch_size 64 \
--img_fc_use_angle 1 \
--img_feat_input_dim 2176 \
--img_fc_dim 1024 \
--rnn_hidden_size 512 \
--eval_every_epochs 5 \
--use_ignore_index 1 \
--arch 'self-monitoring' \
--value_loss_weight 0.5 \
--monitor_sigmoid 0 \
--mse_sum 0 \
--fix_action_ended 0 \
--train_data_augmentation 1 \
--epochs_data_augmentation 300 # pre-train for 300 epochs
Once the training on synthetic data is completed, we can now train on real data.
# co-grounding + self-monitoring + finetune on read data
CUDA_VISIBLE_DEVICES=0 python tasks/R2R-pano/main.py \
--exp_name 'cogrounding-selfmonitoring-agent' \
--batch_size 64 \
--img_fc_use_angle 1 \
--img_feat_input_dim 2176 \
--img_fc_dim 1024 \
--rnn_hidden_size 512 \
--eval_every_epochs 5 \
--use_ignore_index 1 \
--arch 'self-monitoring' \
--value_loss_weight 0.5 \
--monitor_sigmoid 0 \
--mse_sum 0 \
--fix_action_ended 0 \
--resume 'best' \ # resume from the best performing pre-trained model
--max_num_epochs 500 \ # fine-tune until maximum 500 epochs
--exp_name_secondary '_resume|best'
You can check the training process using TensorBoard.
cd tensorboard_logs/
tensorboard --logdir=pano-seq2seq
Inference
Greedy decoding
The default inference model is set to be greedy decoding.
You should see the performance reproduced and match with numbers reported in our ablation study table.
Beam search
For fair comparison with Speaker-Follower, we adopt beam search with the proposed self-monitoring agent.
Once training is completed, we follow the name convention used above to resume the best performing model and use beam search during inference
CUDA_VISIBLE_DEVICES=0 python tasks/R2R-pano/main.py \
--exp_name 'cogrounding-selfmonitoring-agent' \
--batch_size 64 \
--img_fc_use_angle 1 \
--img_feat_input_dim 2176 \
--img_fc_dim 1024 \
--rnn_hidden_size 512 \
--eval_every_epochs 5 \
--use_ignore_index 1 \
--arch 'self-monitoring' \
--value_loss_weight 0.5 \
--monitor_sigmoid 0 \
--mse_sum 0 \
--fix_action_ended 0 \
--resume 'best' \ # resume from best performing model
--eval_beam 1 \ # use beam search for evaluation
--beam_size 15 # set beam size to 15
Progress inference
If the progress monitor output decreases, the agent is required to move back to the previous viewpoint and select the action with next highest probability. We repeat this process until the selected action leads to increasing progress monitor output.
For convenience, we implement this idea by taking advantage of the mini-batch processing. We precompute the progress monitor for a number of navigable directions, but only selects the direction based on the order of action probabilites for these directions. We make sure the agent does not peek/sneak into a direction that it has not yet visited before.
CUDA_VISIBLE_DEVICES=0 python tasks/R2R-pano/main.py \
--exp_name 'cogrounding-selfmonitoring-agent' \
--batch_size 64 \
--img_fc_use_angle 1 \
--img_feat_input_dim 2176 \
--img_fc_dim 1024 \
--rnn_hidden_size 512 \
--eval_every_epochs 5 \
--use_ignore_index 1 \
--arch 'self-monitoring' \
--value_loss_weight 0.5 \
--monitor_sigmoid 0 \
--mse_sum 0 \
--fix_action_ended 0 \
--resume 'best' \ # resume from best performing model
--eval_only 1 \
--progress_inference 1 \ # use progress inference for evaluation
--beam_size 5 # this precomputes the progress monitor for 5 navigable directions
Reproducibility
Note that our results were originally produced using PyTorch 0.4 on a Titan Xp GPU. You may get slightly different results due to using different PyTorch versions, different GPUs, or different hyper-parameters. The overall performance should be fairly robust even with different random seeds. Please open an issue or contact Chih-Yao Ma if you can not reproduce the results.
Acknowledgments
This research was partially supported by DARPAs Lifelong Learning Machines (L2M) program, under Cooperative Agreement HR0011-18-2-001. We thank the authors from Speaker-Follower arXiv, Ronghang Hu and Daniel Fried, for communicating with us and providing details of the implementation and synthetic instructions for fair comparison.
Citation
If you find this repository useful, please cite our paper:
@inproceedings{ma2019selfmonitoring,
title={Self-Monitoring Navigation Agent via Auxiliary Progress Estimation},
author={Chih-Yao Ma and Jiasen Lu and Zuxuan Wu and Ghassan AlRegib and Zsolt Kira and Richard Socher and Caiming Xiong},
booktitle={Proceedings of the International Conference on Learning Representations (ICLR)},
year={2019},
url={https://arxiv.org/abs/1901.03035},
}