• Stars
    star
    600
  • Rank 72,992 (Top 2 %)
  • Language
    Python
  • License
    Other
  • Created over 3 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

[CVPR 2021] Self-supervised depth estimation from short sequences

The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth

Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel J. Brostow and Michael Firman – CVPR 2021

[Link to paper]

We introduce ManyDepth, an adaptive approach to dense depth estimation that can make use of sequence information at test time, when it is available.

  • βœ… Self-supervised: We train from monocular video only. No depths or poses are needed at training or test time.
  • βœ… Good depths from single frames; even better depths from short sequences.
  • βœ… Efficient: Only one forward pass at test time. No test-time optimization needed.
  • βœ… State-of-the-art self-supervised monocular-trained depth estimation on KITTI and CityScapes.

5 minute CVPR presentation video link

Overview

Cost volumes are commonly used for estimating depths from multiple input views:

Cost volume used for aggreagting sequences of frames

However, cost volumes do not easily work with self-supervised training.

Baseline: Depth from cost volume input without our contributions

In our paper, we:

  • Introduce an adaptive cost volume to deal with unknown scene scales
  • Fix problems with moving objects
  • Introduce augmentations to deal with static cameras and start-of-sequence frames

These contributions enable cost volumes to work with self-supervised training:

ManyDepth: Depth from cost volume input with our contributions

With our contributions, short test-time sequences give better predictions than methods which predict depth from just a single frame.

ManyDepth vs Monodepth2 depths and error maps

✏️ πŸ“„ Citation

If you find our work useful or interesting, please cite our paper:

@inproceedings{watson2021temporal,
    author = {Jamie Watson and
              Oisin Mac Aodha and
              Victor Prisacariu and
              Gabriel Brostow and
              Michael Firman},
    title = {{The Temporal Opportunist: Self-Supervised Multi-Frame Monocular Depth}},
    booktitle = {Computer Vision and Pattern Recognition (CVPR)},
    year = {2021}
}

πŸ“ˆ Results

Our ManyDepth method outperforms all previous methods in all subsections across most metrics, whether or not the baselines use multiple frames at test time. See our paper for full details.

KITTI results table

πŸ‘€ Reproducing Paper Results

To recreate the results from our paper, run:

CUDA_VISIBLE_DEVICES=<your_desired_GPU> \
python -m manydepth.train \
    --data_path <your_KITTI_path> \
    --log_dir <your_save_path>  \
    --model_name <your_model_name>

Depending on the size of your GPU, you may need to set --batch_size to be lower than 12. Additionally you can train a high resolution model by adding --height 320 --width 1024.

For instructions on downloading the KITTI dataset, see Monodepth2

To train a CityScapes model, run:

CUDA_VISIBLE_DEVICES=<your_desired_GPU> \
python -m manydepth.train \
    --data_path <your_preprocessed_cityscapes_path> \
    --log_dir <your_save_path>  \
    --model_name <your_model_name> \
    --dataset cityscapes_preprocessed \
    --split cityscapes_preprocessed \
    --freeze_teacher_epoch 5 \
    --height 192 --width 512

Note here the --freeze_teacher_epoch 5 command - we found this to be important for Cityscapes models, due to the large number of images in the training set.

This assumes you have already preprocessed the CityScapes dataset using SfMLearner's prepare_train_data.py script. We used the following command:

python prepare_train_data.py \
    --img_height 512 \
    --img_width 1024 \
    --dataset_dir <path_to_downloaded_cityscapes_data> \
    --dataset_name cityscapes \
    --dump_root <your_preprocessed_cityscapes_path> \
    --seq_length 3 \
    --num_threads 8

Note that while we use the --img_height 512 flag, the prepare_train_data.py script will save images which are 1024x384 as it also crops off the bottom portion of the image. You could probably save disk space without a loss of accuracy by preprocessing with --img_height 256 --img_width 512 (to create 512x192 images), but this isn't what we did for our experiments.

πŸ’Ύ Pretrained weights and evaluation

You can download weights for some pretrained models here:

To evaluate a model on KITTI, run:

CUDA_VISIBLE_DEVICES=<your_desired_GPU> \
python -m manydepth.evaluate_depth \
    --data_path <your_KITTI_path> \
    --load_weights_folder <your_model_path>
    --eval_mono

Make sure you have first run export_gt_depth.py to extract ground truth files.

And to evaluate a model on Cityscapes, run:

CUDA_VISIBLE_DEVICES=<your_desired_GPU> \
python -m manydepth.evaluate_depth \
    --data_path <your_cityscapes_path> \
    --load_weights_folder <your_model_path>
    --eval_mono \
    --eval_split cityscapes

During evaluation, we crop and evaluate on the middle 50% of the images.

We provide ground truth depth files HERE, which were converted from pixel disparities using intrinsics and the known baseline. Download this and unzip into splits/cityscapes.

If you want to evaluate a teacher network (i.e. the monocular network used for consistency loss), then add the flag --eval_teacher. This will load the weights of mono_encoder.pth and mono_depth.pth, which are provided for our KITTI models.

πŸ–Ό Running on your own images

We provide some sample code in test_simple.py which demonstrates multi-frame inference. This predicts depth for a sequence of two images cropped from a dashcam video. Prediction also requires an estimate of the intrinsics matrix, in json format. For the provided test images, we have estimated the intrinsics to be equivalent to those of the KITTI dataset. Note that the intrinsics provided in the json file are expected to be in normalised coordinates.

Download and unzip model weights from one of the links above, and then run the following command:

python -m manydepth.test_simple \
    --target_image_path assets/test_sequence_target.jpg \
    --source_image_path assets/test_sequence_source.jpg \
    --intrinsics_json_path assets/test_sequence_intrinsics.json \
    --model_path path/to/weights

A predicted depth map rendering will be saved to assets/test_sequence_target_disp.jpeg.

πŸ‘©β€βš–οΈ License

Copyright Β© Niantic, Inc. 2021. Patent Pending. All rights reserved. Please see the license file for terms.

More Repositories

1

monodepth2

[ICCV 2019] Monocular depth estimation from a single image
Jupyter Notebook
4,013
star
2

simplerecon

[ECCV 2022] SimpleRecon: 3D Reconstruction Without 3D Convolutions
Python
1,252
star
3

stereo-from-mono

[ECCV 2020] Learning stereo from single images using monocular depth estimation networks
Python
379
star
4

mickey

[CVPR 2024 - Oral] Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences
Python
335
star
5

ace

[CVPR 2023 - Highlight] Accelerated Coordinate Encoding (ACE): Learning to Relocalize in Minutes using RGB and Poses
Python
328
star
6

diffusionerf

[CVPR 2023] DiffusioNeRF: Regularizing Neural Radiance Fields with Denoising Diffusion Models
Python
281
star
7

wavelet-monodepth

[CVPR 2021] Monocular depth estimation using wavelets for efficiency
Jupyter Notebook
221
star
8

footprints

[CVPR 2020] Estimation of the visible and hidden traversable space from a single color image
Python
220
star
9

map-free-reloc

[ECCV 2022] Map-free Visual Relocalization: Metric Pose Relative to a Single Image
Python
197
star
10

depth-hints

[ICCV 2019] Depth Hints are complementary depth suggestions which improve monocular depth estimation algorithms trained from stereo pairs
Jupyter Notebook
183
star
11

acezero

ACE0 is a learning-based structure-from-motion approach that estimates camera parameters of sets of images by learning a multi-view consistent, implicit scene representation.
171
star
12

nerf-object-removal

[CVPR 2023] Removing Objects From Neural Radiance Fields
Python
95
star
13

marepo

[CVPR 2024 Highlight] Map-Relative Pose Regression for Visual Re-Localization
Python
79
star
14

scoring-without-correspondences

[CVPR 2023] Two-view Geometry Scoring Without Correspondences
Python
78
star
15

implicit-depth

[CVPR 2023] Virtual Occlusions Through Implicit Depth
Python
72
star
16

rectified-features

[ECCV 2020] Single image depth prediction allows us to rectify planar surfaces in images and extract view-invariant local features for better feature matching
63
star
17

image-box-overlap

[ECCV 2020] Training neural networks to predict visual overlap of images, through interpretable non-metric box embeddings
Jupyter Notebook
53
star
18

panoptic-forecasting

[CVPR 2021] Forecasting the panoptic segmentation of future video frames
Python
47
star
19

relpose-gnn

[3DV21] Visual Camera Re-Localization Using Graph Neural Networks and Relative Pose Supervision, M. Türkoǧlu et al.
Python
37
star
20

modron

Modron - Cloud security compliance
JavaScript
33
star
21

time-repeatability

[ICRA 2020] Learning to Predict Repeatability of Interest Points
6
star
22

nianticlabs.github.io

HTML
4
star
23

metagame-balance

[AAMAS 2023] Bilevel Entropy based Mechanism Design for Balancing Meta in Video Games
Python
4
star