Monocular Visual-Inertial Depth Estimation
This repository contains code and models for our paper:
Monocular Visual-Inertial Depth Estimation
Diana Wofk, RenΓ© Ranftl, Matthias MΓΌller, Vladlen Koltun
For a quick overview of the work you can watch the short talk and teaser on YouTube.
Introduction
We present a visual-inertial depth estimation pipeline that integrates monocular depth estimation and visual-inertial odometry to produce dense depth estimates with metric scale. Our approach consists of three stages: (1) input processing, where RGB and IMU data feed into monocular depth estimation alongside visual-inertial odometry, (2) global scale and shift alignment, where monocular depth estimates are fitted to sparse depth from VIO in a least-squares manner, and (3) learning-based dense scale alignment, where globally-aligned depth is locally realigned using a dense scale map regressed by the ScaleMapLearner (SML). The images at the bottom in the diagram above illustrate a VOID sample being processed through our pipeline; from left to right: the input RGB, ground truth depth, sparse depth from VIO, globally-aligned depth, scale map scaffolding, dense scale map regressed by SML, final depth output.
Setup
-
Setup dependencies:
conda env create -f environment.yaml conda activate vi-depth
-
Pick one or more ScaleMapLearner (SML) models and download the corresponding weights to the
weights
folder.Depth Predictor SML on VOID 150 SML on VOID 500 SML on VOID 1500 DPT-BEiT-Large model model model DPT-SwinV2-Large model model model DPT-Large model model model DPT-Hybrid model* model model DPT-SwinV2-Tiny model model model DPT-LeViT model model model MiDaS-small model model model *Also available with pretraining on TartanAir: model
Inference
-
Place inputs into the
input
folder. An input image and corresponding sparse metric depth map are expected:input βββ image # RGB image β βββ <timestamp>.png β βββ ... βββ sparse_depth # sparse metric depth map βββ <timestamp>.png # as 16b PNG βββ ...
The
load_sparse_depth
function inrun.py
may need to be modified depending on the format in which sparse depth is stored. By default, the depth storage method used in the VOID dataset is assumed. -
Run the
run.py
script as follows:DEPTH_PREDICTOR="dpt_beit_large_512" NSAMPLES=150 SML_MODEL_PATH="weights/sml_model.dpredictor.${DEPTH_PREDICTOR}.nsamples.${NSAMPLES}.ckpt" python run.py -dp $DEPTH_PREDICTOR -ns $NSAMPLES -sm $SML_MODEL_PATH --save-output
-
The
--save-output
flag enables saving outputs to theoutput
folder. By default, the following outputs will be saved per sample:output βββ ga_depth # metric depth map after global alignment β βββ <timestamp>.pfm # as PFM β βββ <timestamp>.png # as 16b PNG β βββ ... βββ sml_depth # metric depth map output by SML βββ <timestamp>.pfm # as PFM βββ <timestamp>.png # as 16b PNG βββ ...
Evaluation
Models provided in this repo were trained on the VOID dataset.
-
Download the VOID dataset following the instructions in the VOID dataset repo.
-
To evaluate on VOID test sets, run the
evaluate.py
script as follows:DATASET_PATH="/path/to/void_release/" DEPTH_PREDICTOR="dpt_beit_large_512" NSAMPLES=150 SML_MODEL_PATH="weights/sml_model.dpredictor.${DEPTH_PREDICTOR}.nsamples.${NSAMPLES}.ckpt" python evaluate.py -ds $DATASET_PATH -dp $DEPTH_PREDICTOR -ns $NSAMPLES -sm $SML_MODEL_PATH
Results for the example shown above:
Averaging metrics for globally-aligned depth over 800 samples Averaging metrics for SML-aligned depth over 800 samples +---------+----------+----------+ | metric | GA Only | GA+SML | +---------+----------+----------+ | RMSE | 191.36 | 142.85 | | MAE | 115.84 | 76.95 | | AbsRel | 0.069 | 0.046 | | iRMSE | 72.70 | 57.13 | | iMAE | 49.32 | 34.25 | | iAbsRel | 0.071 | 0.048 | +---------+----------+----------+
To evaluate on VOID test sets at different densities (void_150, void_500, void_1500), change the
NSAMPLES
argument above accordingly.
Citation
If you reference our work, please consider citing the following:
@inproceedings{wofk2023videpth,
author = {{Wofk, Diana and Ranftl, Ren\'{e} and M{\"u}ller, Matthias and Koltun, Vladlen}},
title = {{Monocular Visual-Inertial Depth Estimation}},
booktitle = {{IEEE International Conference on Robotics and Automation (ICRA)}},
year = {{2023}}
}
Acknowledgements
Our work builds on and uses code from MiDaS, timm, and PyTorch Lightning. We'd like to thank the authors for making these libraries and frameworks available.