• Stars
    star
    4,041
  • Rank 10,245 (Top 0.3 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 5 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Code for robust monocular depth estimation described in "Ranftl et. al., Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer, TPAMI 2022"

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer

This repository contains code to compute depth from a single image. It accompanies our paper:

Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-shot Cross-dataset Transfer
RenΓ© Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, Vladlen Koltun

and our preprint:

Vision Transformers for Dense Prediction
RenΓ© Ranftl, Alexey Bochkovskiy, Vladlen Koltun

MiDaS was trained on up to 12 datasets (ReDWeb, DIML, Movies, MegaDepth, WSVD, TartanAir, HRWSI, ApolloScape, BlendedMVS, IRS, KITTI, NYU Depth V2) with multi-objective optimization. The original model that was trained on 5 datasets (MIX 5 in the paper) can be found here. The figure below shows an overview of the different MiDaS models; the bubble size scales with number of parameters.

Setup

  1. Pick one or more models and download the corresponding weights to the weights folder:

MiDaS 3.1

MiDaS 3.0: Legacy transformer models dpt_large_384 and dpt_hybrid_384

MiDaS 2.1: Legacy convolutional models midas_v21_384 and midas_v21_small_256

  1. Set up dependencies:

    conda env create -f environment.yaml
    conda activate midas-py310

optional

For the Next-ViT model, execute

git submodule add https://github.com/isl-org/Next-ViT midas/external/next_vit

For the OpenVINO model, install

pip install openvino

Usage

  1. Place one or more input images in the folder input.

  2. Run the model with

    python run.py --model_type <model_type> --input_path input --output_path output

    where <model_type> is chosen from dpt_beit_large_512, dpt_beit_large_384, dpt_beit_base_384, dpt_swin2_large_384, dpt_swin2_base_384, dpt_swin2_tiny_256, dpt_swin_large_384, dpt_next_vit_large_384, dpt_levit_224, dpt_large_384, dpt_hybrid_384, midas_v21_384, midas_v21_small_256, openvino_midas_v21_small_256.

  3. The resulting depth maps are written to the output folder.

optional

  1. By default, the inference resizes the height of input images to the size of a model to fit into the encoder. This size is given by the numbers in the model names of the accuracy table. Some models do not only support a single inference height but a range of different heights. Feel free to explore different heights by appending the extra command line argument --height. Unsupported height values will throw an error. Note that using this argument may decrease the model accuracy.
  2. By default, the inference keeps the aspect ratio of input images when feeding them into the encoder if this is supported by a model (all models except for Swin, Swin2, LeViT). In order to resize to a square resolution, disregarding the aspect ratio while preserving the height, use the command line argument --square.

via Camera

If you want the input images to be grabbed from the camera and shown in a window, leave the input and output paths away and choose a model type as shown above:

python run.py --model_type <model_type> --side

The argument --side is optional and causes both the input RGB image and the output depth map to be shown side-by-side for comparison.

via Docker

  1. Make sure you have installed Docker and the NVIDIA Docker runtime.

  2. Build the Docker image:

    docker build -t midas .
  3. Run inference:

    docker run --rm --gpus all -v $PWD/input:/opt/MiDaS/input -v $PWD/output:/opt/MiDaS/output -v $PWD/weights:/opt/MiDaS/weights midas

    This command passes through all of your NVIDIA GPUs to the container, mounts the input and output directories and then runs the inference.

via PyTorch Hub

The pretrained model is also available on PyTorch Hub

via TensorFlow or ONNX

See README in the tf subdirectory.

Currently only supports MiDaS v2.1.

via Mobile (iOS / Android)

See README in the mobile subdirectory.

via ROS1 (Robot Operating System)

See README in the ros subdirectory.

Currently only supports MiDaS v2.1. DPT-based models to be added.

Accuracy

We provide a zero-shot error $\epsilon_d$ which is evaluated for 6 different datasets (see paper). Lower error values are better. $\color{green}{\textsf{Overall model quality is represented by the improvement}}$ (Imp.) with respect to MiDaS 3.0 DPTL-384. The models are grouped by the height used for inference, whereas the square training resolution is given by the numbers in the model names. The table also shows the number of parameters (in millions) and the frames per second for inference at the training resolution (for GPU RTX 3090):

MiDaS Model DIW
WHDR
Eth3d
AbsRel
Sintel
AbsRel
TUM
Ξ΄1
KITTI
Ξ΄1
NYUv2
Ξ΄1
$\color{green}{\textsf{Imp.}}$
%
Par.
M
FPS
Β 
Inference height 512
v3.1 BEiTL-512 0.1137 0.0659 0.2366 6.13 11.56* 1.86* $\color{green}{\textsf{19}}$ 345 5.7
v3.1 BEiTL-512$\tiny{\square}$ 0.1121 0.0614 0.2090 6.46 5.00* 1.90* $\color{green}{\textsf{34}}$ 345 5.7
Inference height 384
v3.1 BEiTL-512 0.1245 0.0681 0.2176 6.13 6.28* 2.16* $\color{green}{\textsf{28}}$ 345 12
v3.1 Swin2L-384$\tiny{\square}$ 0.1106 0.0732 0.2442 8.87 5.84* 2.92* $\color{green}{\textsf{22}}$ 213 41
v3.1 Swin2B-384$\tiny{\square}$ 0.1095 0.0790 0.2404 8.93 5.97* 3.28* $\color{green}{\textsf{22}}$ 102 39
v3.1 SwinL-384$\tiny{\square}$ 0.1126 0.0853 0.2428 8.74 6.60* 3.34* $\color{green}{\textsf{17}}$ 213 49
v3.1 BEiTL-384 0.1239 0.0667 0.2545 7.17 9.84* 2.21* $\color{green}{\textsf{17}}$ 344 13
v3.1 Next-ViTL-384 0.1031 0.0954 0.2295 9.21 6.89* 3.47* $\color{green}{\textsf{16}}$ 72 30
v3.1 BEiTB-384 0.1159 0.0967 0.2901 9.88 26.60* 3.91* $\color{green}{\textsf{-31}}$ 112 31
v3.0 DPTL-384 0.1082 0.0888 0.2697 9.97 8.46 8.32 $\color{green}{\textsf{0}}$ 344 61
v3.0 DPTH-384 0.1106 0.0934 0.2741 10.89 11.56 8.69 $\color{green}{\textsf{-10}}$ 123 50
v2.1 Large384 0.1295 0.1155 0.3285 12.51 16.08 8.71 $\color{green}{\textsf{-32}}$ 105 47
Inference height 256
v3.1 Swin2T-256$\tiny{\square}$ 0.1211 0.1106 0.2868 13.43 10.13* 5.55* $\color{green}{\textsf{-11}}$ 42 64
v2.1 Small256 0.1344 0.1344 0.3370 14.53 29.27 13.43 $\color{green}{\textsf{-76}}$ 21 90
Inference height 224
v3.1 LeViT224$\tiny{\square}$ 0.1314 0.1206 0.3148 18.21 15.27* 8.64* $\color{green}{\textsf{-40}}$ 51 73

* No zero-shot error, because models are also trained on KITTI and NYU Depth V2
$\square$ Validation performed at square resolution, either because the transformer encoder backbone of a model does not support non-square resolutions (Swin, Swin2, LeViT) or for comparison with these models. All other validations keep the aspect ratio. A difference in resolution limits the comparability of the zero-shot error and the improvement, because these quantities are averages over the pixels of an image and do not take into account the advantage of more details due to a higher resolution.
Best values per column and same validation height in bold

Improvement

The improvement in the above table is defined as the relative zero-shot error with respect to MiDaS v3.0 DPTL-384 and averaging over the datasets. So, if $\epsilon_d$ is the zero-shot error for dataset $d$, then the $\color{green}{\textsf{improvement}}$ is given by $100(1-(1/6)\sum_d\epsilon_d/\epsilon_{d,\rm{DPT_{L-384}}})$%.

Note that the improvements of 10% for MiDaS v2.0 β†’ v2.1 and 21% for MiDaS v2.1 β†’ v3.0 are not visible from the improvement column (Imp.) in the table but would require an evaluation with respect to MiDaS v2.1 Large384 and v2.0 Large384 respectively instead of v3.0 DPTL-384.

Depth map comparison

Zoom in for better visibility

Speed on Camera Feed

Test configuration

  • Windows 10
  • 11th Gen Intel Core i7-1185G7 3.00GHz
  • 16GB RAM
  • Camera resolution 640x480
  • openvino_midas_v21_small_256

Speed: 22 FPS

Changelog

  • [Dec 2022] Released MiDaS v3.1:
    • New models based on 5 different types of transformers (BEiT, Swin2, Swin, Next-ViT, LeViT)
    • Training datasets extended from 10 to 12, including also KITTI and NYU Depth V2 using BTS split
    • Best model, BEiTLarge 512, with resolution 512x512, is on average about 28% more accurate than MiDaS v3.0
    • Integrated live depth estimation from camera feed
  • [Sep 2021] Integrated to Huggingface Spaces with Gradio. See Gradio Web Demo.
  • [Apr 2021] Released MiDaS v3.0:
  • [Nov 2020] Released MiDaS v2.1:
  • [Jul 2020] Added TensorFlow and ONNX code. Added online demo.
  • [Dec 2019] Released new version of MiDaS - the new model is significantly more accurate and robust
  • [Jul 2019] Initial release of MiDaS (Link)

Citation

Please cite our paper if you use this code or any of the models:

@ARTICLE {Ranftl2022,
    author  = "Ren\'{e} Ranftl and Katrin Lasinger and David Hafner and Konrad Schindler and Vladlen Koltun",
    title   = "Towards Robust Monocular Depth Estimation: Mixing Datasets for Zero-Shot Cross-Dataset Transfer",
    journal = "IEEE Transactions on Pattern Analysis and Machine Intelligence",
    year    = "2022",
    volume  = "44",
    number  = "3"
}

If you use a DPT-based model, please also cite:

@article{Ranftl2021,
	author    = {Ren\'{e} Ranftl and Alexey Bochkovskiy and Vladlen Koltun},
	title     = {Vision Transformers for Dense Prediction},
	journal   = {ICCV},
	year      = {2021},
}

Acknowledgements

Our work builds on and uses code from timm and Next-ViT. We'd like to thank the authors for making these libraries available.

License

MIT License

More Repositories

1

Open3D

Open3D: A Modern Library for 3D Data Processing
C++
10,396
star
2

OpenBot

OpenBot leverages smartphones as brains for low-cost robots. We have designed a small electric vehicle that costs about $50 and serves as a robot body. Our software stack for Android smartphones supports advanced robotics workloads such as person following and real-time autonomous navigation.
Swift
2,679
star
3

DPT

Dense Prediction Transformers
Python
1,794
star
4

ZoeDepth

Metric depth estimation from a single image
Jupyter Notebook
1,750
star
5

Open3D-ML

An extension of Open3D to address 3D Machine Learning tasks
Python
1,644
star
6

PhotorealismEnhancement

Code & Data for Enhancing Photorealism Enhancement
Python
1,237
star
7

MultiObjectiveOptimization

Source code for Neural Information Processing Systems (NeurIPS) 2018 paper "Multi-Task Learning as Multi-Objective Optimization"
Python
753
star
8

lang-seg

Language-Driven Semantic Segmentation
Jupyter Notebook
654
star
9

FastGlobalRegistration

Fast Global Registration
C++
489
star
10

Open3D-PointNet2-Semantic3D

Semantic3D segmentation with Open3D and PointNet++
Python
461
star
11

FreeViewSynthesis

Code repository for "Free View Synthesis", ECCV 2020.
Python
262
star
12

StableViewSynthesis

Python
212
star
13

DeepLagrangianFluids

Code repository for "Lagrangian Fluid Simulation with Continuous Convolutions", ICLR 2020.
Python
187
star
14

spear

SPEAR: A Simulator for Photorealistic Embodied AI Research
C++
173
star
15

DirectFuturePrediction

Code for the paper "Learning to Act by Predicting the Future", Alexey Dosovitskiy and Vladlen Koltun, ICLR 2017
Python
152
star
16

VI-Depth

Code for Monocular Visual-Inertial Depth Estimation (ICRA 2023)
Python
139
star
17

NPHard

Combinatorial Optimization with Graph Convolutional Networks and Guided Tree Search
Python
139
star
18

redwood-3dscan

Python
100
star
19

Intseg

Interactive Image Segmentation with Latent Diversity
Python
78
star
20

TanksAndTemples

Toolbox for the TanksAndTemples benchmark website
Python
58
star
21

dcflow

Code for the paper "Accurate Optical Flow via Direct Cost Volume Processing. Jia Xu, RenΓ© Ranftl, and Vladlen Koltun. CVPR 2017"
C++
52
star
22

adaptive-surface-reconstruction

Adaptive Surface Reconstruction for 3D Data Processing
Python
48
star
23

DFE

Python
43
star
24

open3d-cmake-find-package

Find pre-installed Open3D package in CMake
C++
42
star
25

vision-for-action

Code to accompany "Does computer vision matter for action?"
Python
41
star
26

LMRS

Source code for ICLR 2020 paper: "Learning to Guide Random Search"
Python
39
star
27

open3d_downloads

Hosting Open3D test data for development use
23
star
28

Open3D-3rdparty

C
20
star
29

open3d-cmake-external-project

Use Open3D as a CMake external project
CMake
15
star
30

0shot-object-insertion

Simulation and robot code for contact-rich household object insertion (ICRA 2023).
Python
11
star
31

objects-with-lighting

8
star
32

Open3D-Viewer

C++
7
star
33

generalized-smoothing

Companion code for the ICML 2022 paper "Generalizing Gaussian Smoothing for Random Search"
Python
5
star
34

Open3D-Python-CI

Testing Open3D Python package from PyPI and Conda
4
star
35

MetaLearningTradeoffs

Source code for the NeurIPS 2020 Paper: Modeling and Optimization Trade-off in Meta-learning.
Python
4
star
36

hello-world-docker-action

Dockerfile
1
star
37

mshadow

Forked from https://github.com/dmlc/mshadow
C++
1
star