🚀 MetricDepth (ICCV23) 🚀
The is official PyTorch implementation of paper "Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image" (Metric 3D)
Authors: Wei Yin1*, Chi Zhang2*, Hao Chen3, Zhipeng Cai3, Gang Yu4, Kaixuan Wang1, Xiaozhi Chen1, Chunhua Shen3
Arxiv | Video | Hugging Face 🤗 (Comming Soon)
@JUGGHM1,5 will also maintain this project.
2nd Monocular Depth Estimation Challenge in CVPR 2023
The Champion ofZero-shot testing on NYU and KITTI, Comparable with SoTA Supervised methods
News and TO DO LIST
- Stronger models and tiny models
- Hugging face
[2023/8/10]
Inference codes, pretrained weights, and demo released.
🌼 Abstract
Existing monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocular models can be stably trained over 8 Million images with several Kilo camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera set.
🎩 Fully zero-shot state-of-the-art mono-depth
2nd Monocular Depth Estimation Challenge in CVPR 2023
Highlights: The Champion 🏆 ofRouting benchmarks
WITHOUT re-training the models on target datasets, we obtain comparable performance against SoTA supervised methods Adabins and NewCRFs.
Backbone | KITTI |
KITTI |
KITTI |
KITTI AbsRel ↓ | KITTI RMSE ↓ | KITTI log10 ↓ | NYU |
NYU |
NYU |
NYU AbsRel ↓ | NYU RMSE ↓ | NYU RMSE-log ↓ | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Adabins | Efficient-B5 | 0.964 | 0.995 | 0.999 | 0.058 | 2.360 | 0.088 | 0.903 | 0.984 | 0.997 | 0.103 | 0.0444 | 0.364 |
NewCRFs | SwinT-L | 0.974 | 0.997 | 0.999 | 0.052 | 2.129 | 0.079 | 0.922 | 0.983 | 0.994 | 0.095 | 0.041 | 0.334 |
Ours (CSTM_label) | ConvNeXt-L | 0.964 | 0.993 | 0.998 | 0.058 | 2.770 | 0.092 | 0.944 | 0.986 | 0.995 | 0.083 | 0.035 | 0.310 |
🌈 DEMOs
In-the-wild 3D reconstruction
Image | Reconstruction | Pointcloud File | |
---|---|---|---|
room | Download | ||
Colosseum | Download | ||
chess | Download |
All three images are downloaded from unplash and put in the data/wild_demo directory.
3D metric reconstruction, Metric3D × DroidSLAM
Metric3D can also provide scale information for DroidSLAM, help to solve the scale drift problem for better trajectories. (Left: Droid-SLAM (mono). Right: Droid-SLAM with Metric-3D)
Bird Eyes' View (Left: Droid-SLAM (mono). Right: Droid-SLAM with Metric-3D)
Front View
KITTI odemetry evaluation (Translational RMS drift (t_rel, ↓) / Rotational RMS drift (r_rel, ↓))
Modality | seq 00 | seq 02 | seq 05 | seq 06 | seq 08 | seq 09 | seq 10 | |
---|---|---|---|---|---|---|---|---|
ORB-SLAM2 | Mono | 11.43/0.58 | 10.34/0.26 | 9.04/0.26 | 14.56/0.26 | 11.46/0.28 | 9.3/0.26 | 2.57/0.32 |
Droid-SLAM | Mono | 33.9/0.29 | 34.88/0.27 | 23.4/0.27 | 17.2/0.26 | 39.6/0.31 | 21.7/0.23 | 7/0.25 |
Droid+Ours | Mono | 1.44/0.37 | 2.64/0.29 | 1.44/0.25 | 0.6/0.2 | 2.2/0.3 | 1.63/0.22 | 2.73/0.23 |
ORB-SLAM2 | Stereo | 0.88/0.31 | 0.77/0.28 | 0.62/0.26 | 0.89/0.27 | 1.03/0.31 | 0.86/0.25 | 0.62/0.29 |
Metric3D makes the mono-SLAM scale-aware, like stereo systems.
KITTI sequence videos - Youtube
2011_09_30_drive_0028 / 2011_09_30_drive_0033 / 2011_09_30_drive_0034
videos - Bilibili (TODO)
Estimated pose
2011_09_30_drive_0033 / 2011_09_30_drive_0034 / 2011_10_03_drive_0042
Pointcloud files
2011_09_30_drive_0033 / 2011_09_30_drive_0034 / 2011_10_03_drive_0042
🔨 Installation
One-line Installation
pip install -r requirements.txt
Or you could also try:
30 series GPUs, pytorch1.10
conda create -n metric3d python=3.7
conda activate metric3d
pip install torch==1.10.0+cu111 torchvision==0.11.0+cu111 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
mim install "mmcv-full==1.3.17"
pip install "mmsegmentation==0.19.0"
40 series GPUs, pytorch2.0
conda create -n metric3d python=3.8
conda activate metric3d
pip3 install torch torchvision torchaudio
pip install -r requirements.txt
pip install -U openmim
mim install mmengine
mim install "mmcv-full==1.7.1"
pip install "mmsegmentation==0.30.0"
pip install numpy==1.20.0
pip install scikit-image==0.18.0
dataset annotation components
With off-the-shelf depth datasets, we need to generate json annotaions in compatible with this dataset, which is organized by:
dict(
'files':list(
dict(
'rgb': 'data/kitti_demo/rgb/xxx.png',
'depth': 'data/kitti_demo/depth/xxx.png',
'depth_scale': 1000.0 # the depth scale of gt depth img.
'cam_in': [fx, fy, cx, cy],
),
dict(
...
),
...
)
)
To generate such annotations, please refer to the "Inference" section.
configs
In mono/configs
we provide different config setups.
Intrinsics of the canonical camera is set bellow:
canonical_space = dict(
img_size=(512, 960),
focal_length=1000.0,
),
where cx and cy is set to be half of the image size.
Inference settings are defined as
depth_range=(0, 1),
depth_normalize=(0.3, 150),
crop_size = (512, 1088),
where the images will be first resized as the crop_size
and then fed into the model.
✈️ Inference
Download Checkpoint
Encoder | Decoder | Link | |
---|---|---|---|
v1.0 | ConvNeXt-L | Hourglass-Decoder | Download |
More models are on the way...
Dataset Mode
- put the trained ckpt file
model.pth
inweight/
. - generate data annotation by following the code
data/gene_annos_kitti_demo.py
, which includes 'rgb', (optional) 'intrinsic', (optional) 'depth', (optional) 'depth_scale'. - change the 'test_data_path' in
test_*.sh
to the*.json
path. - run
source test_kitti.sh
orsource test_nyu.sh
.
In-the-Wild Mode
- put the trained ckpt file
model.pth
inweight/
. - change the 'test_data_path' in
test.sh
to the image folder path. - run
source test.sh
. As no intrinsics are provided, we provided by default 9 settings of focal length.
❓ Q & A
Q1: Why depth maps look good but pointclouds are distorted?
Because the focal length is not properly set! Please find a proper focal length by modifying codes here yourself.
Q2: Why the pointclouds are too slow to be generated?
Because the images are too large! Use smaller ones instead.
Q3: Why predicted depth maps are not satisfactory?
First be sure all black padding regions at image boundaries are cropped out. Then please try again. Besides, metric 3D is not almighty. Some objects (chandeliers, drones...) / camera views (aerial view, bev...) do not occur frequently in the training datasets. We will going deeper into this and release more powerful solutions.
🍭 Acknowledgement
This work is empowered by DJI Automotive1
and collaborators from Tencent2, ZJU3, Intel Labs4, and HKUST5
We appreciate efforts from the contributors of mmcv, all concerning datasets, and NVDS.
📧 Citation
@article{yin2023metric,
title={Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image},
author={Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, Chunhua Shen},
booktitle={ICCV},
year={2023}
}
License and Contact
The Metric 3D code is under a GPLv3 License for non-commercial usage. For further questions, contact Dr. Wei Yin [[email protected]] and Mr. Mu Hu [[email protected]].