DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion
arXiv - CVF
Paper (CVPR 2021):YouTube
Presentation (5 min.):DeepVideoMVS is a learning-based online multi-view depth prediction approach on posed video streams, where the scene geometry information computed in the previous time steps is propagated to the current time step. The backbone of the approach is a real-time capable, lightweight encoder-decoder that relies on cost volumes computed from pairs of images. We extend it with a ConvLSTM cell at the bottleneck layer, and a hidden state propagation scheme where we partially account for the viewpoint changes between time steps.
This extension brings only a small overhead of computation time and memory consumption over the backbone, while improving the depth predictions significantly. As the result, DeepVideoMVS achieves highly accurate depth maps with real-time performance and low memory consumption. It produces noticeably more consistent depth predictions than our backbone and the existing methods throughout a sequence, which gets reflected as less noisy reconstructions.
dvmvs-fusionnet-in-scannet-756.mp4
dvmvs-fusionnet-vs-pairnet-in-7scenes-chess.mp4
Citation
If you find this project useful for your research, please cite:
@inproceedings{Duzceker_2021_CVPR,
author = {Duzceker, Arda and Galliani, Silvano and Vogel, Christoph and
Speciale, Pablo and Dusmanu, Mihai and Pollefeys, Marc},
title = {DeepVideoMVS: Multi-View Stereo on Video With Recurrent Spatio-Temporal Fusion},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2021},
pages = {15324-15333}
}
Dependencies / Installation
conda create -n dvmvs-env
conda activate dvmvs-env
conda install -y -c conda-forge -c pytorch -c fvcore -c pytorch3d \
python=3.8 \
pytorch=1.5.1 \
torchvision=0.6.1 \
cudatoolkit=10.2 \
opencv=4.4.0 \
tqdm=4.50.2 \
scipy=1.5.2 \
fvcore=0.1.2 \
pytorch3d=0.2.5
pip install \
pytools==2020.4 \
kornia==0.3.2 \
path==15.0.0 \
protobuf==3.13.0 \
tensorboardx==2.1
git clone https://github.com/ardaduz/deep-video-mvs.git
pip install -e deep-video-mvs
Data Structure
The scripts for parsing the datasets are provided in the dataset
folder.
All of the scripts might not work straight ahead due to naming and foldering conventions while
downloading the datasets, however they should help reduce the effort required. Exporting ScanNet .sens
files, both for training and testing, should work with very minimal effort. The script that is provided here is
a modified version of the official code and similarly requires python2.
During testing, the system expects a data structure for a particular scene as provided in the
sample-data/hololens-dataset/000
. We assume PNG format for all images.
images
folder contains the input images that will be used by the model and the naming convention is not important. The system considers the sequential order alphabetically.depth
folder contains the groundtruth depth maps that are used for metric evaluation, the names must match with the color images. The depth images must be uint16 PNG format, and the depth value in millimeters. For example, if the depth is 1.12 meters for a pixel location, it should read 1120 in the groundtruth depth image.poses.txt
contains CAMERA-TO-WORLD pose corresponding to each color and depth image. Each line is one flattened pose in homogeneous coordinates.K.txt
is the intrinsic matrix for a given sequence after the images are undistorted.
During training, the system expects each scene to be placed in a folder, and color image and depth image for a time step to be packed inside a zipped numpy archive (.npz). See the code here. We use frame_skip=4 while exporting the ScanNet training and validation scenes due to the large amount of data. The training/validation split of unique scenes which are used during this work is also provided here, one may replace the randomly generated ones with these two.
Training and Testing:
-
The pre-trained weights are provided. They are placed here and automatically loaded during testing.
-
There are no command line arguments for the system. Instead, many general parameters are controlled from the
config.py
within the classConfig
. -
Please adjust the input and output folder locations (and/or other settings) inside the
config.py
.
Training:
In addition to the general Config
, very specific training hyperparameters
like subsequence length, learning rate, etc. are controlled directly inside
the training scripts from TrainingHyperparameters
class.
To train the networks from scratch, please refer to the detailed explanation of the procedure that we follow provided in the supplementary of the paper. In summary, we first train the pairnet independently and use some modules' weights to partially initialize our fusionnet. For fusionnet, we start by training the cell and the decoder, which are randomly initialized, and then gradually unfreeze the other modules. Finally, we finetune only the cell while warping the hidden states with the predictions instead of the groundtruth depths.
- pairnet training script:
cd deep-video-mvs/dvmvs/pairnet python run-training.py
- fusionnet training script:
cd deep-video-mvs/dvmvs/fusionnet python run-training.py
Testing:
We provide two scripts for running the inference:
1. Bulk Testing
First is run-testing.py
for evaluating on multiple datasets and/or sequences
at a run. This script requires pre-selected keyframe filenames for the desired sequences,
similar to the ones provided in the sample-data/indices
. In
a keyframe file,
each row represents a timestep, the entry in the first column represents the reference frame, and the entries in
the second, third, ... columns represent the measurement frames used for the cost volume computation.
One can determine the keyframe filenames with custom keyframe selection approaches,
or we provide the simulation of our keyframe selection heuristic in
simulate_keyframe_buffer.py
. The predictions and errors of bulk testing
are saved to the Config.test_result_folder
.
2. Single Scene Online Testing
Second is run-testing-online.py
to run the testing in an online fashion.
One can specify a single scene in Config.test_online_scene_path
, then run the online inference
to evaluate on the specified scene.
In this script, we use our keyframe selection heuristic on-the-go and predict the depth maps
for the selected keyframes (Attention! We do not predict depth maps for all images).
The predictions and errors of single scene online testing are saved to the working directory.
To run the online testing:
cd deep-video-mvs/dvmvs/fusionnet
python run-testing-online.py
Predicted depth maps for a scene and
average error of each frame are saved in .npz format.
Errors contain 8 different metrics for each frame in order:
abs_error
, abs_relative_error
, abs_inverse_error
, squared_relative_error
,
rmse
, ratio_125
, ratio_125_2
, ratio_125_3
. They can be accessed with:
predictions = numpy.load(prediction_filename)['arr_0']
errors = numpy.load(error_filename)['arr_0']
Comparison with the Existing Methods:
In this work, our method is compared with DELTAS,
GP-MVS, DPSNet,
MVDepthNet and Neural RGBD.
For ease of evaluation, we slightly modified the inference codes of the first four methods to make them
compatible with the data structure and the keyframe selection files. For Neural RGBD, in contrast, we adjusted
the data structure and used the original code. The modified inference codes (and the finetuned weights, if necessary)
are provided in the dvmvs/baselines
directory. Please refer to the paper for the comparison results.
TSDF Reconstructions:
TSDF reconstructions demonstrated in the paper and in the videos are acquired with the implementation from
https://github.com/andyzeng/tsdf-fusion-python. A modified version of this code is provided as
sample-data/run-tsdf-reconstruction.py
. Same with the original implementation, additional packages are
required to run the script, and can be installed to the existing environment with:
conda activate dvmvs-env
conda install -c conda-forge numba scikit-image pycuda
We strongly recommend CUDA Toolkit (nvcc is required) and pycuda installation to get reasonable runtimes.
Default arguments for sample-data/run-tsdf-reconstruction.py
are readily set.
In addition to the input/output locations, the reconstruction resolution (--voxel_size
) and the maximum
depth value in a depth map to be backprojected and fused (--max_depth
) can be controlled.
There are three additional flags (--use_groundtruth_to_anchor
, --save_progressive
, --save_groundtruth
),
please refer to the script or use the --help
flag for their functionalities.
For convenience, example predictions from the sample scene are also provided in the sample-data/predictions
folder.
Finally, a couple of low resolution 3D reconstruction results are given in the sample-data/reconstructions
folder.