🚀 MetricDepth (ICCV23) 🚀

The is official PyTorch implementation of paper "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" (Metric 3D)

Authors: Wei Yin^1*, Chi Zhang^2*, Hao Chen³, Zhipeng Cai³, Gang Yu⁴, Kaixuan Wang¹, Xiaozhi Chen¹, Chunhua Shen³

Arxiv | Video | Hugging Face 🤗 (Comming Soon)

@JUGGHM^1,5 will also maintain this project.

The Champion of 2nd Monocular Depth Estimation Challenge in CVPR 2023

Zero-shot testing on NYU and KITTI, Comparable with SoTA Supervised methods

News and TO DO LIST

Stronger models and tiny models
Hugging face
[2023/8/10] Inference codes, pretrained weights, and demo released.

🌼 Abstract

Existing monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained over 8 Million images with several Kilo camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera set.

🎩 Fully zero-shot state-of-the-art mono-depth

Highlights: The Champion 🏆 of 2nd Monocular Depth Estimation Challenge in CVPR 2023

Routing benchmarks

WITHOUT re-training the models on target datasets, we obtain comparable performance against SoTA supervised methods Adabins and NewCRFs.

	Backbone	KITTI $\delta 1$ ↑	KITTI $\delta 2$ ↑	KITTI $\delta 3$ ↑	KITTI AbsRel ↓	KITTI RMSE ↓	KITTI log10 ↓	NYU $\delta 1$ ↑	NYU $\delta 2$ ↑	NYU $\delta 3$ ↑	NYU AbsRel ↓	NYU RMSE ↓	NYU RMSE-log ↓
Adabins	Efficient-B5	0.964	0.995	0.999	0.058	2.360	0.088	0.903	0.984	0.997	0.103	0.0444	0.364
NewCRFs	SwinT-L	0.974	0.997	0.999	0.052	2.129	0.079	0.922	0.983	0.994	0.095	0.041	0.334
Ours (CSTM_label)	ConvNeXt-L	0.964	0.993	0.998	0.058	2.770	0.092	0.944	0.986	0.995	0.083	0.035	0.310

🌈 DEMOs

In-the-wild 3D reconstruction

	Image	Reconstruction	Pointcloud File
room			Download
Colosseum			Download
chess			Download

All three images are downloaded from unplash and put in the data/wild_demo directory.

3D metric reconstruction, Metric3D × DroidSLAM

Metric3D can also provide scale information for DroidSLAM, help to solve the scale drift problem for better trajectories. (Left: Droid-SLAM (mono). Right: Droid-SLAM with Metric-3D)

Bird Eyes' View (Left: Droid-SLAM (mono). Right: Droid-SLAM with Metric-3D)

Front View

KITTI odemetry evaluation (Translational RMS drift (t_rel, ↓) / Rotational RMS drift (r_rel, ↓))

	Modality	seq 00	seq 02	seq 05	seq 06	seq 08	seq 09	seq 10
ORB-SLAM2	Mono	11.43/0.58	10.34/0.26	9.04/0.26	14.56/0.26	11.46/0.28	9.3/0.26	2.57/0.32
Droid-SLAM	Mono	33.9/0.29	34.88/0.27	23.4/0.27	17.2/0.26	39.6/0.31	21.7/0.23	7/0.25
Droid+Ours	Mono	1.44/0.37	2.64/0.29	1.44/0.25	0.6/0.2	2.2/0.3	1.63/0.22	2.73/0.23
ORB-SLAM2	Stereo	0.88/0.31	0.77/0.28	0.62/0.26	0.89/0.27	1.03/0.31	0.86/0.25	0.62/0.29

Metric3D makes the mono-SLAM scale-aware, like stereo systems.

KITTI sequence videos - Youtube

2011_09_30_drive_0028 / 2011_09_30_drive_0033 / 2011_09_30_drive_0034

videos - Bilibili (TODO)

Estimated pose

2011_09_30_drive_0033 / 2011_09_30_drive_0034 / 2011_10_03_drive_0042

Pointcloud files

2011_09_30_drive_0033 / 2011_09_30_drive_0034 / 2011_10_03_drive_0042

🔨 Installation

One-line Installation

pip install -r requirements.txt

Or you could also try:

30 series GPUs, pytorch1.10

conda create -n metric3d python=3.7
conda activate metric3d
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
mim install "mmcv-full==1.3.17"
pip install "mmsegmentation==0.19.0"

40 series GPUs, pytorch2.0

conda create -n metric3d python=3.8
conda activate metric3d
pip3 install torch torchvision torchaudio
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
mim install "mmcv-full==1.7.1"
pip install "mmsegmentation==0.30.0"
pip install numpy==1.20.0
pip install scikit-image==0.18.0

dataset annotation components

With off-the-shelf depth datasets, we need to generate json annotaions in compatible with this dataset, which is organized by:

dict(
	'files':list(
		dict(
			'rgb': 'data/kitti_demo/rgb/xxx.png',
			'depth': 'data/kitti_demo/depth/xxx.png',
			'depth_scale': 1000.0 # the depth scale of gt depth img.
			'cam_in': [fx, fy, cx, cy],
		),

		dict(
			...
		),

		...
	)
)

To generate such annotations, please refer to the "Inference" section.

configs

In mono/configs we provide different config setups.

Intrinsics of the canonical camera is set bellow:

    canonical_space = dict(
        img_size=(512, 960),
        focal_length=1000.0,
    ),

where cx and cy is set to be half of the image size.

Inference settings are defined as

    depth_range=(0, 1),
    depth_normalize=(0.3, 150),
    crop_size = (512, 1088),

where the images will be first resized as the crop_size and then fed into the model.

✈️ Inference

Download Checkpoint

	Encoder	Decoder	Link
v1.0	ConvNeXt-L	Hourglass-Decoder	Download

More models are on the way...

Dataset Mode

put the trained ckpt file model.pth in weight/.
generate data annotation by following the code data/gene_annos_kitti_demo.py, which includes 'rgb', (optional) 'intrinsic', (optional) 'depth', (optional) 'depth_scale'.
change the 'test_data_path' in test_*.sh to the *.json path.
run source test_kitti.sh or source test_nyu.sh.

In-the-Wild Mode

put the trained ckpt file model.pth in weight/.
change the 'test_data_path' in test.sh to the image folder path.
run source test.sh. As no intrinsics are provided, we provided by default 9 settings of focal length.

❓ Q & A

Q1: Why depth maps look good but pointclouds are distorted?

Because the focal length is not properly set! Please find a proper focal length by modifying codes here yourself.

Q2: Why the pointclouds are too slow to be generated?

Because the images are too large! Use smaller ones instead.

Q3: Why predicted depth maps are not satisfactory?

First be sure all black padding regions at image boundaries are cropped out. Then please try again. Besides, metric 3D is not almighty. Some objects (chandeliers, drones...) / camera views (aerial view, bev...) do not occur frequently in the training datasets. We will going deeper into this and release more powerful solutions.

🍭 Acknowledgement

This work is empowered by DJI Automotive¹

and collaborators from Tencent², ZJU³, Intel Labs⁴, and HKUST⁵

We appreciate efforts from the contributors of mmcv, all concerning datasets, and NVDS.

📧 Citation

@article{yin2023metric,
  title={Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image},
  author={Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, Chunhua Shen},
  booktitle={ICCV},
  year={2023}
}

License and Contact

The Metric 3D code is under a GPLv3 License for non-commercial usage. For further questions, contact Dr. Wei Yin [[email protected]] and Mr. Mu Hu [[email protected]].

YvanYin/Metric3D

YvanYin

Reviews

Repository Details