Discover TRI-ML/vidar Open Source project

TRI-VIDAR: TRI's Depth Estimation Repository

Official PyTorch repository for TRI's latest published depth estimation works. Our goal is to provide a clean environment to reproduce our results and facilitate further research in this field. This repository is an updated version of PackNet-SfM, our previous monocular depth estimation repository, featuring a different license.

Models

(Experimental) For convenient inference, we provide a growing list of our models (PackNet, DeFiNe) model over torchhub without installation.

PackNet

PackNet is a self-supervised monocular depth estimation model, to load a model trained on KITTI and run inference on an RGB image:

import torch
packnet_model = torch.hub.load("TRI-ML/vidar", "PackNet", pretrained=True, trust_repo=True)
rgb_image = # 13HW torch.tensor
depth_pred = model(rgb_image)

DeFiNe

DeFiNe is a multi-view depth estimation model, to load a model trained on Scannet and run inference on multiple posed RGB images:

import torch
define_model = torch.hub.load("TRI-ML/vidar", "DeFiNe", pretrained=True, trust_repo=True)
frames = {} 
frames["rgb"] = # a list of frames as 13HW torch.tensors
frames["intrinsics"] = # a list of 133 torch.tensor intrinsics matrices (one for each image)
frames["pose"] = # a batch of 144 relative poses to reference frame (one will be identity)
depth_preds = define_model(frames) # list of depths, one for each frame

Installation

We recommend using our provided dockerfile (see nvidia-docker2 instructions) to have a reproducible environment. To set up the repository, type in a terminal (only tested in Ubuntu 18.04):

git clone --recurse-submodules https://github.com/TRI-ML/vidar.git   # Clone repository with submodules
cd vidar                                                             # Move to repository folder
make docker-build                                                    # Build the docker image (recommended)

To start our docker container, simply type make docker-interactive. From inside the docker, you can run scripts with the following command pattern:

python3 scripts/run.py <config.yaml>         # Single CPU/GPU  
python3 scripts/run_ddp.py <config.yaml>     # Distributed Data Parallel (DDP) multi-GPU

To verify that the environment is set up correctly, you can run a simple overfit test:

# Download a tiny subset of KITTI
mkdir /data/vidar 
curl -s https://tri-ml-public.s3.amazonaws.com/github/vidar/datasets/KITTI_tiny.tar | tar xv -C /data/vidar/
# Inside docker
python3 scripts/run.py configs/overfit/kitti_tiny.yaml

Once training is over (which takes around 1 minute), you should achieve results similar to this:

If you want to use features related to AWS (for dataset access) and WandB (for experiment management), you can create associated accounts and configure your shell with the following environment variables:

export AWS_SECRET_ACCESS_KEY=something    # AWS secret key
export AWS_ACCESS_KEY_ID=something        # AWS access key
export AWS_DEFAULT_REGION=something       # AWS default region
export WANDB_ENTITY=something             # WANDB entity
export WANDB_API_KEY=something            # WANDB API key

Configuration

Configuration files (stored in the configs folder) are the entry points for training and inference. The basic structure of a configuration file is:

wrapper:                            # Training parameters 
    <parameters>
arch:                               # Architecture used
    model:                          # Model file and parameters
        file: <model_file>
        <parameters>
    networks:                       # Networks available to the model
        network1:                   # Network1 file and parameters 
            file: <network1_file>
            <parameters>
        network2:                   # Network2 file and parameters 
            file: <network2_file>
            <parameters>
        ...
    losses:                         # Losses available to the model 
        loss1:                      # Loss1 file and parameters
            file: <loss1_file>      
            <parameters>
        loss2:                      # Loss2 file and parameters
            file: <loss2_file>      
            <parameters>
        ...        
evaluation:                         # Evaluation metrics for different tasks
    evaluation1:                    # Evaluation1 and parameters
        <parameters>
    evaluation2:                    # Evaluation2 and parameters
        <parameters>
    ...
optimizers:                         # Optimizers used to train the networks
    network1:                       # Optimizer for network1 and parameters
        <parameters>
    network2:                       # Optimizer for network2 and parameters
        <parameters>
    ...
datasets:                           # Datasets used
    train:                          # Training dataset and parameters
        <parameters>                
        augmentation:               # Training augmentations and parameters 
            <parameters>
        dataloader:                 # Training dataloader and parameters
            <parameters>
    validation:                     # Validation dataset and parameters
        <parameters>                
        augmentation:               # Validation augmentations and parameters
            <parameters>
        dataloader:                 # Validation dataloader and parameters
            <parameters>

To enable WandB logging, you can set these additional parameters in your configuration file:

wandb:
    folder: /data/vidar/wandb     # Where the wandb run is stored
    entity: your_entity           # Wandb entity
    project: your_project         # Wandb project
    num_validation_logs: X        # Number of visualization logs
    tags: [tag1,tag2,...]         # Wandb tags
    notes: note                   # Wandb notes

To enable checkpoint saving, you can set these additional parameters in your configuration file:

checkpoint:
    folder: /data/vidar/checkpoints       # Local folder to store checkpoints
    save_code: True                       # Save repository folder as well
    keep_top: 5                           # How many checkpoints should be stored
    s3_bucket: s3://path/to/s3/bucket     # [optional] AWS folder to store checkpoints        
    dataset: [0]                          # [optional] Validation dataset index to track
    monitor: [depth|abs_rel_pp_gt(0)_0]   # [optional] Validation metric to track
    mode: [min]                           # [optional] If the metric is minimized (min) or maximized (max)

To facilitate the reutilization of configuration files, we also provide a recipe functionality, that enables parameter sharing. To use a recipe, simply type recipe: <path/to/recipe>|<entry> as an additional parameter, to copy all entries from that recipe onto that section. For example:

wrapper: 
  recipe: wrapper|default

will insert all parameters from section default of configs/recipes/wrapper.yaml onto the wrapper section of the configuration file. Parameters added after the recipe will overwrite those copied over, to facilitate customization.

Datasets

In our provided configuration files, datasets are assumed to be downloaded in /data/vidar/<dataset-name>. For convenience, we provide links to some datasets we commonly use here (all licences still apply):

Dataset	Version	Labels	Splits
KITTI	KITTI_raw	RGB, Depth, Poses, Intrinsics	Train / Validation / Test
KITTI	KITTI_tiny	RGB, Depth, Poses, Intrinsics	Train
DDAD	DDAD_trainval	Depth prediction	Train / Validation
	DDAD_tiny	Depth estimation	Train
	DDAD_test	Depth estimation	Test
PD	PD_guda	Depth prediction	Train / Validation
PD	PD_draft	Depth estimation	Train / Validation
VKITTI2	VKITTI2	Full Virtual KITTI 2 dataset	Train
VKITTI2	VKITTI2_tiny	Tiny version of VKITTI2	Train

Visualization

We also provide tools for dataset and prediction visualization, based on our CamViz library. It is added as a submodule in the externals folder. To use it from inside the docker, run xhost +local: before entering it. To visualize the information contained in different datasets, after it has been processed to be used by our repository, use the following command:

python3 demos/display_datasets/display_datasets.py <dataset>

Some examples of visualization results you will generate for KITTI and DDAD are shown below (more examples can be found in the demo configuration file demos/display_datasets/config.yaml):

You can move the virtual viewing camera with the mouse, holding the left button to translate, the right button to rotate, and scrolling the wheel to zoom in/out. The up/down arrow keys change between temporal contexts, and the left/right arrow keys change between labels. Pressing SPACE changes between pointcloud color schemes (pixel color or per-camera).

Publications

3D Packing for Self-Supervised Monocular Depth Estimation (CVPR 2020, oral)

Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon

Abstract: Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception. In this work, we propose a novel self-supervised monocular depth estimation method combining geometry with a new deep network, PackNet, learned only from unlabeled monocular videos. Our architecture leverages novel symmetrical packing and unpacking blocks to jointly learn to compress and decompress detail-preserving representations using 3D convolutions. Although self-supervised, our method outperforms other self, semi, and fully supervised methods on the KITTI benchmark. The 3D inductive bias in PackNet enables it to scale with input resolution and number of parameters without overfitting, generalizing better on out-of-domain data such as the NuScenes dataset. Furthermore, it does not require large-scale supervised pretraining on ImageNet and can run in real-time. Finally, we release DDAD (Dense Depth for Automated Driving), a new urban driving dataset with more challenging and accurate depth evaluation, thanks to longer-range and denser ground-truth depth generated from high-density LiDARs mounted on a fleet of self-driving cars operating world-wide.

GT depth	Abs.Rel.	Sq.Rel.	RMSE	RMSElog	SILog	d_1.25	d_1.25²	d_1.25³
ResNet18 \| Self-Supervised \| 192x640 \| ImageNet → KITTI
Original	0.116	0.811	4.902	0.198	19.259	0.865	0.957	0.981
Improved	0.087	0.471	3.947	0.135	12.879	0.913	0.983	0.996
PackNet \| Self-Supervised \| 192x640 \| KITTI
Original	0.111	0.800	4.576	0.189	18.504	0.880	0.960	0.982
Improved	0.078	0.420	3.485	0.121	11.725	0.931	0.986	0.996

@inproceedings{tri-packnet,
  title = {3D Packing for Self-Supervised Monocular Depth Estimation},
  author = {Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, Adrien Gaidon},
  booktitle = {Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)}
  year = {2020},
}

Multi-Frame Self-Supervised Depth Estimation with Transformers (CVPR 2022)

Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon

Abstract: Multi-frame depth estimation improves over single-frame approaches by also leveraging geometric relationships between images via feature matching, in addition to learning appearance-based features. In this paper we revisit feature matching for self-supervised monocular depth estimation, and propose a novel transformer architecture for cost volume generation. We use depth-discretized epipolar sampling to select matching candidates, and refine predictions through a series of self- and cross-attention layers. These layers sharpen the matching probability between pixel features, improving over standard similarity metrics prone to ambiguities and local minima. The refined cost volume is decoded into depth estimates, and the whole pipeline is trained end-to-end from videos using only a photometric objective. Experiments on the KITTI and DDAD datasets show that our DepthFormer architecture establishes a new state of the art in self-supervised monocular depth estimation, and is even competitive with highly specialized supervised single-frame architectures. We also show that our learned cross-attention network yields representations transferable across datasets, increasing the effectiveness of pre-training strategies.

GT depth	Frames	Abs.Rel.	Sq.Rel.	RMSE	RMSElog	SILog	d_1.25	d_1.25²	d_1.25³
DepthFormer \| Self-Supervised \| 192x640 \| ImageNet → KITTI
Original	Single (t)	0.117	0.876	4.692	0.193	18.940	0.874	0.960	0.981
Original	Multi (t-1,t)	0.090	0.661	4.149	0.175	17.260	0.905	0.963	0.982
Improved	Single (t)	0.083	0.464	3.591	0.126	12.156	0.926	0.986	0.996
Improved	Multi (t-1,t)	0.055	0.271	2.917	0.095	9.160	0.955	0.991	0.998

@inproceedings{tri-depthformer,
  title = {Multi-Frame Self-Supervised Depth with Transformers},
  author = {Vitor Guizilini, Rares Ambrus, Dian Chen, Sergey Zakharov, Adrien Gaidon},
  booktitle = {Proceedings of the International Conference on Computer Vision and Pattern Recognition (CVPR)}
  year = {2022},
}

Full Surround Monodepth from Multiple Cameras (RA-L + ICRA 2022)

Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon

Abstract: Self-supervised monocular depth and ego-motion estimation is a promising approach to replace or supplement expensive depth sensors such as LiDAR for robotics applications like autonomous driving. However, most research in this area focuses on a single monocular camera or stereo pairs that cover only a fraction of the scene around the vehicle. In this work, we extend monocular self-supervised depth and ego-motion estimation to large-baseline multi-camera rigs. Using generalized spatio-temporal contexts, pose consistency constraints, and carefully designed photometric loss masking, we learn a single network generating dense, consistent, and scale-aware point clouds that cover the same full surround 360 degree field of view as a typical LiDAR scanner. We also propose a new scale-consistent evaluation metric more suitable to multi-camera settings. Experiments on two challenging benchmarks illustrate the benefits of our approach over strong baselines.

Camera	Abs.Rel.	Sq.Rel.	RMSE	RMSElog	SILog	d_1.25	d_1.25²	d_1.25³
FSM \| Self-Supervised \| 384x640 \| ImageNet → DDAD
Front	0.131	2.940	14.252	0.237	22.226	0.824	0.935	0.969
Front Right	0.205	3.349	13.677	0.353	30.777	0.667	0.852	0.922
Back Right	0.243	3.493	12.266	0.394	33.842	0.594	0.821	0.907
Back	0.194	3.743	16.436	0.348	29.901	0.669	0.850	0.926
Back Left	0.235	3.641	13.570	0.387	31.765	0.594	0.816	0.907
Front Left	0.226	3.861	12.957	0.378	32.795	0.652	0.836	0.909

@inproceedings{tri-fsm,
  title = {Full Surround Monodepth from Multiple Cameras},
  author = {Vitor Guizilini, Igor Vasiljevic, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon},
  booktitle = {Robotics and Automation Letters (RA-L)}
  year = {2022},
}

Self-Supervised Camera Self-Calibration from Videos (ICRA 2022)

Jiading Fang, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon, Matthew R.Walter

Abstract: Camera calibration is integral to robotics and computer vision algorithms that seek to infer geometric properties of the scene from visual input streams. In practice, calibration is a laborious procedure requiring specialized data collection and careful tuning. This process must be repeated whenever the parameters of the camera change, which can be a frequent occurrence for mobile robots and autonomous vehicles. In contrast, self-supervised depth and ego-motion estimation approaches can bypass explicit calibration by inferring per-frame projection models that optimize a view synthesis objective. In this paper, we extend this approach to explicitly calibrate a wide range of cameras from raw videos in the wild. We propose a learning algorithm to regress per-sequence calibration parameters using an efficient family of general camera models. Our procedure achieves self-calibration results with sub-pixel reprojection error, outperforming other learning-based methods. We validate our approach on a wide variety of camera geometries, including perspective, fisheye, and catadioptric. Finally, we show that our approach leads to improvements in the downstream task of depth estimation, achieving state-of-the-art results on the EuRoC dataset with greater computational efficiency than contemporary methods.

@inproceedings{tri-self_calibration,
  title = {Self-Supervised Camera Self-Calibration from Video},
  author = {Jiading Fang, Igor Vasiljevic, Vitor Guizilini, Rares Ambrus, Greg Shakhnarovich, Adrien Gaidon, Matthew Walter},
  booktitle = {IEEE International Conference on Robotics and Automation (ICRA)}
  year = {2022},
}

Depth Field Networks for Generalizable Multi-view Scene Representation (ECCV 2022)

Vitor Guizilini, Igor Vasiljevic, Jiading Fang, Rares Ambrus, Greg Shakhnarovich, Matthew Walter, Adrien Gaidon

Abstract: Modern 3D computer vision leverages learning to boost geometric reasoning, mapping image data to classical structures such as cost volumes or epipolar constraints to improve matching. These architectures are specialized according to the particular problem, and thus require significant task-specific tuning, often leading to poor domain generalization performance. Recently, generalist Transformer architectures have achieved impressive results in tasks such as optical flow and depth estimation by encoding geometric priors as inputs rather than as enforced constraints. In this paper, we extend this idea and propose to learn an implicit, multi-view consistent scene representation, introducing a series of 3D data augmentation techniques as a geometric inductive prior to increase view diversity. We also show that introducing view synthesis as an auxiliary task further improves depth estimation. Our Depth Field Networks (DeFiNe) achieve state-of-the-art results in stereo and video depth estimation without explicit geometric constraints, and improve on zero-shot domain generalization by a wide margin.

@inproceedings{tri-define,
  title={Depth Field Networks For Generalizable Multi-view Scene Representation},
  author={Guizilini, Vitor and Vasiljevic, Igor and Fang, Jiading and Ambru, Rare and Shakhnarovich, Greg and Walter, Matthew R and Gaidon, Adrien},
  booktitle={Computer Vision--ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23--27, 2022, Proceedings, Part XXXII},
  pages={245--262},
  year={2022},
  organization={Springer}
}

License

This repository is released under the CC BY-NC 4.0 license.

TRI-ML/vidar

TRI-ML

Reviews

Repository Details