• Stars
    star
    560
  • Rank 79,541 (Top 2 %)
  • Language
    Python
  • License
    Other
  • Created over 2 years ago
  • Updated 8 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

TRI-VIDAR: TRI's Depth Estimation Repository

Installation | Configuration | Datasets | Visualization | Publications | License

Official PyTorch repository for TRI's latest published depth estimation works. Our goal is to provide a clean environment to reproduce our results and facilitate further research in this field. This repository is an updated version of PackNet-SfM, our previous monocular depth estimation repository, featuring a different license.

Models

(Experimental) For convenient inference, we provide a growing list of our models (PackNet, DeFiNe) model over torchhub without installation.

PackNet

PackNet is a self-supervised monocular depth estimation model, to load a model trained on KITTI and run inference on an RGB image:

import torch
packnet_model = torch.hub.load("TRI-ML/vidar", "PackNet", pretrained=True, trust_repo=True)
rgb_image = # 13HW torch.tensor
depth_pred = model(rgb_image)

DeFiNe

DeFiNe is a multi-view depth estimation model, to load a model trained on Scannet and run inference on multiple posed RGB images:

import torch
define_model = torch.hub.load("TRI-ML/vidar", "DeFiNe", pretrained=True, trust_repo=True)
frames = {} 
frames["rgb"] = # a list of frames as 13HW torch.tensors
frames["intrinsics"] = # a list of 133 torch.tensor intrinsics matrices (one for each image)
frames["pose"] = # a batch of 144 relative poses to reference frame (one will be identity)
depth_preds = define_model(frames) # list of depths, one for each frame

Installation

We recommend using our provided dockerfile (see nvidia-docker2 instructions) to have a reproducible environment. To set up the repository, type in a terminal (only tested in Ubuntu 18.04):

git clone --recurse-submodules https://github.com/TRI-ML/vidar.git   # Clone repository with submodules
cd vidar                                                             # Move to repository folder
make docker-build                                                    # Build the docker image (recommended) 

To start our docker container, simply type make docker-interactive. From inside the docker, you can run scripts with the following command pattern:

python3 scripts/run.py <config.yaml>         # Single CPU/GPU  
python3 scripts/run_ddp.py <config.yaml>     # Distributed Data Parallel (DDP) multi-GPU

To verify that the environment is set up correctly, you can run a simple overfit test:

# Download a tiny subset of KITTI
mkdir /data/vidar 
curl -s https://tri-ml-public.s3.amazonaws.com/github/vidar/datasets/KITTI_tiny.tar | tar xv -C /data/vidar/
# Inside docker
python3 scripts/run.py configs/overfit/kitti_tiny.yaml

Once training is over (which takes around 1 minute), you should achieve results similar to this:

If you want to use features related to AWS (for dataset access) and WandB (for experiment management), you can create associated accounts and configure your shell with the following environment variables:

export AWS_SECRET_ACCESS_KEY=something    # AWS secret key
export AWS_ACCESS_KEY_ID=something        # AWS access key
export AWS_DEFAULT_REGION=something       # AWS default region
export WANDB_ENTITY=something             # WANDB entity
export WANDB_API_KEY=something            # WANDB API key

Configuration

Configuration files (stored in the configs folder) are the entry points for training and inference. The basic structure of a configuration file is:

wrapper:                            # Training parameters 
    <parameters>
arch:                               # Architecture used
    model:                          # Model file and parameters
        file: <model_file>
        <parameters>
    networks:                       # Networks available to the model
        network1:                   # Network1 file and parameters 
            file: <network1_file>
            <parameters>
        network2:                   # Network2 file and parameters 
            file: <network2_file>
            <parameters>
        ...
    losses:                         # Losses available to the model 
        loss1:                      # Loss1 file and parameters
            file: <loss1_file>      
            <parameters>
        loss2:                      # Loss2 file and parameters
            file: <loss2_file>      
            <parameters>
        ...        
evaluation:                         # Evaluation metrics for different tasks
    evaluation1:                    # Evaluation1 and parameters
        <parameters>
    evaluation2:                    # Evaluation2 and parameters
        <parameters>
    ...
optimizers:                         # Optimizers used to train the networks
    network1:                       # Optimizer for network1 and parameters
        <parameters>
    network2:                       # Optimizer for network2 and parameters
        <parameters>
    ...
datasets:                           # Datasets used
    train:                          # Training dataset and parameters
        <parameters>                
        augmentation:               # Training augmentations and parameters 
            <parameters>
        dataloader:                 # Training dataloader and parameters
            <parameters>
    validation:                     # Validation dataset and parameters
        <parameters>                
        augmentation:               # Validation augmentations and parameters
            <parameters>
        dataloader:                 # Validation dataloader and parameters
            <parameters>

To enable WandB logging, you can set these additional parameters in your configuration file:

wandb:
    folder: /data/vidar/wandb     # Where the wandb run is stored
    entity: your_entity           # Wandb entity
    project: your_project         # Wandb project
    num_validation_logs: X        # Number of visualization logs
    tags: [tag1,tag2,...]         # Wandb tags
    notes: note                   # Wandb notes

To enable checkpoint saving, you can set these additional parameters in your configuration file:

checkpoint:
    folder: /data/vidar/checkpoints       # Local folder to store checkpoints
    save_code: True                       # Save repository folder as well
    keep_top: 5                           # How many checkpoints should be stored
    s3_bucket: s3://path/to/s3/bucket     # [optional] AWS folder to store checkpoints        
    dataset: [0]                          # [optional] Validation dataset index to track
    monitor: [depth|abs_rel_pp_gt(0)_0]   # [optional] Validation metric to track
    mode: [min]                           # [optional] If the metric is minimized (min) or maximized (max)

To facilitate the reutilization of configuration files, we also provide a recipe functionality, that enables parameter sharing. To use a recipe, simply type recipe: <path/to/recipe>|<entry> as an additional parameter, to copy all entries from that recipe onto that section. For example:

wrapper: 
  recipe: wrapper|default

will insert all parameters from section default of configs/recipes/wrapper.yaml onto the wrapper section of the configuration file. Parameters added after the recipe will overwrite those copied over, to facilitate customization.

Datasets

In our provided configuration files, datasets are assumed to be downloaded in /data/vidar/<dataset-name>. For convenience, we provide links to some datasets we commonly use here (all licences still apply):

Dataset Version Labels Splits
KITTI KITTI_raw RGB, Depth, Poses, Intrinsics Train / Validation / Test
KITTI_tiny RGB, Depth, Poses, Intrinsics Train
DDAD DDAD_trainval Depth prediction Train / Validation
DDAD_tiny Depth estimation Train
DDAD_test Depth estimation Test
PD PD_guda Depth prediction Train / Validation
PD_draft Depth estimation Train / Validation
VKITTI2 VKITTI2 Full Virtual KITTI 2 dataset Train
VKITTI2_tiny Tiny version of VKITTI2 Train

Visualization

We also provide tools for dataset and prediction visualization, based on our CamViz library. It is added as a submodule in the externals folder. To use it from inside the docker, run xhost +local: before entering it. To visualize the information contained in different datasets, after it has been processed to be used by our repository, use the following command:

python3 demos/display_datasets/display_datasets.py <dataset>   

Some examples of visualization results you will generate for KITTI and DDAD are shown below (more examples can be found in the demo configuration file demos/display_datasets/config.yaml):

You can move the virtual viewing camera with the mouse, holding the left button to translate, the right button to rotate, and scrolling the wheel to zoom in/out. The up/down arrow keys change between temporal contexts, and the left/right arrow keys change between labels. Pressing SPACE changes between pointcloud color schemes (pixel color or per-camera).

Publications

3D Packing for Self-Supervised Monocular Depth Estimation (CVPR 2020, oral)

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon

Abstract: Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception. In this work, we propose a novel self-supervised monocular depth estimation method combining geometry with a new deep network, PackNet, learned only from unlabeled monocular videos. Our architecture leverages novel symmetrical packing and unpacking blocks to jointly learn to compress and decompress detail-preserving representations using 3D convolutions. Although self-supervised, our method outperforms other self, semi, and fully supervised methods on the KITTI benchmark. The 3D inductive bias in PackNet enables it to scale with input resolution and number of parameters without overfitting, generalizing better on out-of-domain data such as the NuScenes dataset. Furthermore, it does not require large-scale supervised pretraining on ImageNet and can run in real-time. Finally, we release DDAD (Dense Depth for Automated Driving), a new urban driving dataset with more challenging and accurate depth evaluation, thanks to longer-range and denser ground-truth depth generated from high-density LiDARs mounted on a fleet of self-driving cars operating world-wide.

GT depth Abs.Rel. Sq.Rel. RMSE RMSElog SILog d1.25 d1.252 d1.253
ResNet18 | Self-Supervised | 192x640 | ImageNet β†’ KITTI
Original 0.116 0.811 4.902 0.198 19.259 0.865 0.957 0.981
Improved 0.087 0.471 3.947 0.135 12.879 0.913 0.983 0.996
PackNet | Self-Supervised | 192x640 | KITTI
Original 0.111 0.800 4.576 0.189 18.504 0.880 0.960 0.982
Improved 0.078 0.420 3.485 0.121 11.725 0.931 0.986 0.996
@inproceedings{tri-packnet,
  title = {3D Packing for Self-Supervised Monocular Depth Estimation},
  author = {Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon},
  booktitle = {Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)}
  year = {2020},
}

Multi-Frame Self-Supervised Depth Estimation with Transformers (CVPR 2022)

Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon

Abstract: Multi-frame depth estimation improves over single-frame approaches by also leveraging geometric relationships between images via feature matching, in addition to learning appearance-based features. In this paper we revisit feature matching for self-supervised monocular depth estimation, and propose a novel transformer architecture for cost volume generation. We use depth-discretized epipolar sampling to select matching candidates, and refine predictions through a series of self- and cross-attention layers. These layers sharpen the matching probability between pixel features, improving over standard similarity metrics prone to ambiguities and local minima. The refined cost volume is decoded into depth estimates, and the whole pipeline is trained end-to-end from videos using only a photometric objective. Experiments on the KITTI and DDAD datasets show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures. We also show that our learned cross-attention network yields representations transferable across datasets, increasing the effectiveness of pre-training strategies.

GT depth Frames Abs.Rel. Sq.Rel. RMSE RMSElog SILog d1.25 d1.252 d1.253
DepthFormer | Self-Supervised | 192x640 | ImageNet β†’ KITTI
Original Single (t) 0.117 0.876 4.692 0.193 18.940 0.874 0.960 0.981
Multi (t-1,t) 0.090 0.661 4.149 0.175 17.260 0.905 0.963 0.982
Improved Single (t) 0.083 0.464 3.591 0.126 12.156 0.926 0.986 0.996
Multi (t-1,t) 0.055 0.271 2.917 0.095 9.160 0.955 0.991 0.998
@inproceedings{tri-depthformer,
  title = {Multi-Frame Self-Supervised Depth with Transformers},
  author = {Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon},
  booktitle = {Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)}
  year = {2022},
}

Full Surround Monodepth from Multiple Cameras (RA-L + ICRA 2022)

Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon

Abstract: Self-supervised monocular depth and ego-motion estimation is a promising approach to replace or supplement expensive depth sensors such as LiDAR for robotics applications like autonomous driving. However, most research in this area focuses on a single monocular camera or stereo pairs that cover only a fraction of the scene around the vehicle. In this work, we extend monocular self-supervised depth and ego-motion estimation to large-baseline multi-camera rigs. Using generalized spatio-temporal contexts, pose consistency constraints, and carefully designed photometric loss masking, we learn a single network generating dense, consistent, and scale-aware point clouds that cover the same full surround 360 degree field of view as a typical LiDAR scanner. We also propose a new scale-consistent evaluation metric more suitable to multi-camera settings. Experiments on two challenging benchmarks illustrate the benefits of our approach over strong baselines.

Camera Abs.Rel. Sq.Rel. RMSE RMSElog SILog d1.25 d1.252 d1.253
FSM | Self-Supervised | 384x640 | ImageNet β†’ DDAD
Front 0.131 2.940 14.252 0.237 22.226 0.824 0.935 0.969
Front Right 0.205 3.349 13.677 0.353 30.777 0.667 0.852 0.922
Back Right 0.243 3.493 12.266 0.394 33.842 0.594 0.821 0.907
Back 0.194 3.743 16.436 0.348 29.901 0.669 0.850 0.926
Back Left 0.235 3.641 13.570 0.387 31.765 0.594 0.816 0.907
Front Left 0.226 3.861 12.957 0.378 32.795 0.652 0.836 0.909
@inproceedings{tri-fsm,
  title = {Full Surround Monodepth from Multiple Cameras},
  author = {Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon},
  booktitle = {Robotics and Automation Letters (RA-L)}
  year = {2022},
}

Self-Supervised Camera Self-Calibration from Videos (ICRA 2022)

Jiading Fang, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon, Matthew R.Walter

Abstract: Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.

@inproceedings{tri-self_calibration,
  title = {Self-Supervised Camera Self-Calibration from Video},
  author = {Jiading Fang, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon, Matthew Walter},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}
  year = {2022},
}

Depth Field Networks for Generalizable Multi-view Scene Representation (ECCV 2022)

Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Greg Shakhnarovich, Matthew Walter, Adrien Gaidon

Abstract: Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.

@inproceedings{tri-define,
  title={Depth Field Networks For Generalizable Multi-view Scene Representation},
  author={Guizilini, Vitor and Vasiljevic, Igor and Fang, Jiading and Ambru, Rare and Shakhnarovich, Greg and Walter, Matthew R and Gaidon, Adrien},
  booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXII},
  pages={245--262},
  year={2022},
  organization={Springer}
}

License

This repository is released under the CC BY-NC 4.0 license.

More Repositories

1

packnet-sfm

TRI-ML Monocular Depth Estimation Repository
Python
1,243
star
2

DDAD

Dense Depth for Autonomous Driving (DDAD) dataset.
Python
490
star
3

dd3d

Official PyTorch implementation of DD3D: Is Pseudo-Lidar needed for Monocular 3D Object detection? (ICCV 2021), Dennis Park*, Rares Ambrus*, Vitor Guizilini, Jie Li, and Adrien Gaidon.
Python
464
star
4

prismatic-vlms

A flexible and efficient codebase for training visually-conditioned language models (VLMs)
Python
445
star
5

KP3D

Code for "Self-Supervised 3D Keypoint Learning for Ego-motion Estimation"
Python
240
star
6

PF-Track

Implementation of PF-Track
Python
203
star
7

KP2D

Python
176
star
8

sdflabel

Official PyTorch implementation of CVPR 2020 oral "Autolabeling 3D Objects With Differentiable Rendering of SDF Shape Priors"
Python
161
star
9

realtime_panoptic

Official PyTorch implementation of CVPR 2020 Oral: Real-Time Panoptic Segmentation from Dense Detections
Python
115
star
10

permatrack

Implementation for Learning to Track with Object Permanence
Python
112
star
11

camviz

Visualization Library
Python
101
star
12

dgp

ML Dataset Governance Policy for Autonomous Vehicle Datasets
Python
94
star
13

VEDet

Python
39
star
14

RAP

This is the official code for the paper RAP: Risk-Aware Prediction for Robust Planning: https://arxiv.org/abs/2210.01368
Python
34
star
15

VOST

Code for the VOST dataset
Python
22
star
16

RAM

Implementation for Object Permanence Emerges in a Random Walk along Memory
Python
18
star
17

road

ROAD: Learning an Implicit Recursive Octree Auto-Decoder to Efficiently Encode 3D Shapes (CoRL 2022)
Python
11
star
18

efm_datasets

TRI-ML Embodied Foundation Datasets
Python
8
star
19

OctMAE

Zero-Shot Multi-Object Shape Completion (ECCV 2024)
Python
5
star
20

refine

Official PyTorch implementation of the SIGGRAPH 2024 paper "ReFiNe: Recursive Field Networks for Cross-Modal Multi-Scene Representation"
Python
5
star
21

stochastic_verification

Official repository for the paper "How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation"
Python
5
star
22

HAICU

4
star
23

binomial_cis

Computation of binomial confidence intervals that achieve exact coverage.
Jupyter Notebook
4
star
24

vlm-evaluation

VLM Evaluation: Benchmark for VLMs, spanning text generation tasks from VQA to Captioning
Python
1
star