UniTR: The First Unified Multi-modal Transformer Backbone for 3D Perception
This repo is the official implementation of ICCV2023 paper: UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation as well as the follow-ups. Our UniTR achieves state-of-the-art performance on nuScenes Dataset with a real unified and weight-sharing multi-modal (e.g., Cameras
and LiDARs
) backbone. UniTR is built upon the codebase of DSVT, we have made every effort to ensure that the codebase is clean, concise, easily readable, state-of-the-art, and relies only on minimal dependencies.
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation
Haiyang Wang*, Hao Tang*, Shaoshuai Shi
$^\dagger$ , Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang$^\dagger$ Contact: Haiyang Wang ([email protected]), Hao Tang ([email protected]), Shaoshuai Shi ([email protected])
๐ Gratitude to Tang Hao for extensive code refactoring and noteworthy contributions to open-source initiatives. His invaluable efforts were pivotal in ensuring the seamless completion of UniTR.
๐ฅ ๐ Honestly, the partition in Unitr is slow and takes about 40% of the total time, but this can be optimized to zero with better strategies or some engineering efforts, indicating that there is still huge room for speed optimization. We're not the HPC experts, but if anyone in the industry wants to improve this, we believe it could be halved. Importantly, this part doesn't scale with model size, making it friendly for larger models.
๐ I am going to share my understanding and future plan of the general 3D perception foundation model without reservation. Please refer to ๐ฅ Potential Research๐ฅ . If you find it useful for your research or inspiring, feel free to join me in building this blueprint.
Interpretive Articles: [CVer] [่ชๅจ้ฉพ้ฉถไนๅฟ] [ReadPaper] [็ฅไน] [CSDN] [TechBeat (ๅฐ้จๅๆ)]
News
- [23-09-21] ๐ Code of NuScenes is released.
- [23-08-16] ๐
SOTA
Our single multi-modal UniTR outshines all other non-TTA approaches on nuScenes Detection benchmark (Aug 2023) in terms of NDS74.5
. - [23-08-16] ๐
SOTA
performance of multi-modal 3D object detection and BEV Map Segmentation on NuScenes validation set. - [23-08-15] ๐ UniTR is released on arXiv.
- [23-07-13] ๐ฅ UniTR is accepted at ICCV 2023.
Overview
- ๐ Todo
- ๐ค Introduction
- ๐ Main Results
- ๐ ๏ธ Quick Start
- ๐ Citation
- ๐ Acknowledgments
TODO
- Release the arXiv version.
- SOTA performance of multi-modal 3D object detection (Nuscenes) and BEV Map Segmentation (Nuscenes).
- Clean up and release the code of NuScenes.
- Merge UniTR to OpenPCDet.
Introduction
Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data.
In this paper, we present an efficient multi-modal backbone for outdoor 3D perception, which processes a variety of modalities with unified modeling and shared parameters. It is a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS
higher for 3D object detection and +12.0 mIoU
higher for BEV map segmentation with lower inference latency.
Main results
3D Object Detection (on NuScenes validation)
Model | NDS | mAP | mATE | mASE | mAOE | mAVE | mAAE | ckpt | Log |
---|---|---|---|---|---|---|---|---|---|
UniTR | 73.0 | 70.1 | 26.3 | 24.7 | 26.8 | 24.6 | 17.9 | ckpt | Log |
UniTR+LSS | 73.3 | 70.5 | 26.0 | 24.4 | 26.8 | 24.8 | 18.7 | ckpt | Log |
3D Object Detection (on NuScenes test)
Model | NDS | mAP | mATE | mASE | mAOE | mAVE | mAAE |
---|---|---|---|---|---|---|---|
UniTR | 74.1 | 70.5 | 24.4 | 23.3 | 25.7 | 24.1 | 13.0 |
UniTR+LSS | 74.5 | 70.9 | 24.1 | 22.9 | 25.6 | 24.0 | 13.1 |
Bev Map Segmentation (on NuScenes validation)
Model | mIoU | Drivable | Ped.Cross. | Walkway | StopLine | Carpark | Divider | ckpt | Log |
---|---|---|---|---|---|---|---|---|---|
UniTR | 73.2 | 90.4 | 73.1 | 78.2 | 66.6 | 67.3 | 63.8 | ckpt | Log |
UniTR+LSS | 74.7 | 90.7 | 74.0 | 79.3 | 68.2 | 72.9 | 64.2 | ckpt | Log |
What's new here?
๐ฅ Beats previous SOTAs of outdoor multi-modal 3D Object Detection and BEV Segmentation
Our approach has achieved the best performance on multiple tasks (e.g., 3D Object Detection and BEV Map Segmentation), and it is highly versatile, requiring only the replacement of the backbone.
3D Object Detection
BEV Map Segmentation
๐ฅ Weight-Sharing among all modalities
We introduce a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps.
๐ฅ Prerequisite for 3D vision foundation models
A weight-shared unified multimodal encoder is a prerequisite for foundation models, especially in the context of 3D perception, unifying information from both images and LiDAR data. This is the first truly multimodal fusion backbone, seamlessly connecting to any 3D detection head.
Quick Start
Installation
conda create -n unitr python=3.8
# Install torch, we only test it in pytorch 1.10
pip install torch==1.10.1+cu113 torchvision==0.11.2+cu113 -f https://download.pytorch.org/whl/torch_stable.html
git clone https://github.com/Haiyang-W/UniTR
cd UniTR
# Install extra dependency
pip install -r requirements.txt
# Install nuscenes-devkit
pip install nuscenes-devkit==1.0.5
# Develop
python setup.py develop
Dataset Preparation
- Please download the official NuScenes 3D object detection dataset and organize the downloaded files as follows:
OpenPCDet
โโโ data
โ โโโ nuscenes
โ โ โโโ v1.0-trainval (or v1.0-mini if you use mini)
โ โ โ โโโ samples
โ โ โ โโโ sweeps
โ โ โ โโโ maps
โ โ โ โโโ v1.0-trainval
โโโ pcdet
โโโ tools
- (optional) To install the Map expansion for bev map segmentation task, please download the files from Map expansion (Map expansion pack (v1.3)) and copy the files into your nuScenes maps folder, e.g.
/data/nuscenes/v1.0-trainval/maps
as follows:
OpenPCDet
โโโ maps
โ โโโ ......
โ โโโ boston-seaport.json
โ โโโ singapore-onenorth.json
โ โโโ singapore-queenstown.json
โ โโโ singapore-hollandvillage.json
- Generate the data infos by running the following command (it may take several hours):
# Create dataset info file, lidar and image gt database
python -m pcdet.datasets.nuscenes.nuscenes_dataset --func create_nuscenes_infos \
--cfg_file tools/cfgs/dataset_configs/nuscenes_dataset.yaml \
--version v1.0-trainval \
--with_cam \
--with_cam_gt \
# --share_memory # if use share mem for lidar and image gt sampling (about 24G+143G or 12G+72G)
# share mem will greatly improve your training speed, but need 150G or 75G extra cache mem.
# NOTE: all the experiments used share memory. Share mem will not affect performance
- The format of the generated data is as follows:
OpenPCDet
โโโ data
โ โโโ nuscenes
โ โ โโโ v1.0-trainval (or v1.0-mini if you use mini)
โ โ โ โโโ samples
โ โ โ โโโ sweeps
โ โ โ โโโ maps
โ โ โ โโโ v1.0-trainval
โ โ โ โโโ img_gt_database_10sweeps_withvelo
โ โ โ โโโ gt_database_10sweeps_withvelo
โ โ โ โโโ nuscenes_10sweeps_withvelo_lidar.npy (optional) # if open share mem
โ โ โ โโโ nuscenes_10sweeps_withvelo_img.npy (optional) # if open share mem
โ โ โ โโโ nuscenes_infos_10sweeps_train.pkl
โ โ โ โโโ nuscenes_infos_10sweeps_val.pkl
โ โ โ โโโ nuscenes_dbinfos_10sweeps_withvelo.pkl
โโโ pcdet
โโโ tools
Training
Please download pretrained checkpoint from unitr_pretrain.pth and copy the file under the root folder, eg. UniTR/unitr_pretrain.pth
. This file is the weight of pretraining DSVT on Imagenet and Nuimage datasets.
3D object detection:
# multi-gpu training
## normal
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr.yaml --sync_bn --pretrained_model ../unitr_pretrain.pth --logger_iter_interval 1000
## add lss
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+lss.yaml --sync_bn --pretrained_model ../unitr_pretrain.pth --logger_iter_interval 1000
BEV Map Segmentation:
# multi-gpu training
# note that we don't use image pretrain in BEV Map Segmentation
## normal
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --sync_bn --eval_map --logger_iter_interval 1000
## add lss
cd tools
bash scripts/dist_train.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --sync_bn --eval_map --logger_iter_interval 1000
Testing
3D object detection:
# multi-gpu testing
## normal
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr.yaml --ckpt <CHECKPOINT_FILE>
## add LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+lss.yaml --ckpt <CHECKPOINT_FILE>
BEV Map Segmentation
# multi-gpu testing
## normal
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map.yaml --ckpt <CHECKPOINT_FILE> --eval_map
## add LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_map+lss.yaml --ckpt <CHECKPOINT_FILE> --eval_map
# NOTE: evaluation results will not be logged in *.log, only be printed in the teminal
Cache Testing
- ๐ฅIf the camera and Lidar parameters of the dataset you are using remain constant, then using our cache mode will not affect performance. You can even cache all mapping calculations during the training phase, which can significantly accelerate your training speed.
- Each sample in Nuscenes will
have some variations in camera parameters
, and during normal inference, we disable the cache mode to ensure result accuracy. However, due to the robustness of our mapping, even in scenarios with camera parameter variations like Nuscenes, the performance will only drop slightly (around 0.4 NDS). - Cache mode only supports batch_size 1 now, 8x1=8
- Backbone caching will reduce 40% inference latency in our observation.
# Only for 3D Object Detection
## normal
### cache the mapping computation of multi-modal backbone
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr_cache.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8
## add LSS
### cache the mapping computation of multi-modal backbone
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+LSS_cache.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8
## add LSS
### cache the mapping computation of multi-modal backbone and LSS
cd tools
bash scripts/dist_test.sh 8 --cfg_file ./cfgs/nuscenes_models/unitr+LSS_cache_plus.yaml --ckpt <CHECKPOINT_FILE> --batch_size 8
Performance of cache testing on NuScenes validation (some variations in camera parameters)
Model | NDS | mAP | mATE | mASE | mAOE | mAVE | mAAE |
---|---|---|---|---|---|---|---|
UniTR (Cache Backbone) | 72.6(-0.4) | 69.4(-0.7) | 26.9 | 24.8 | 26.3 | 24.6 | 18.2 |
UniTR+LSS (Cache Backbone) | 73.1(-0.2) | 70.2(-0.3) | 25.8 | 24.4 | 26.0 | 25.3 | 18.2 |
UniTR+LSS (Cache Backbone and LSS) | 72.6๏ผ-0.7๏ผ | 69.3๏ผ-1.2๏ผ | 26.7 | 24.3 | 25.9 | 25.3 | 18.2 |
Potential Research
- Infrastructure of 3D Vision Foundation Model.
An efficient network design is crucial for large models. With a reliable model structure, the development of large models can be advanced. How to make a general multimodal backbone more efficient and easy to deploy. Honestly, the partition in Unitr is slow and takes about 40% of the total time, but this can be optimized to zero with better
partition strategies
orsome engineering efforts
, indicating that there is still huge room for speed optimization. We're not the HPC experts, but if anyone in the industry wants to improve this, we believe it could be halved. Importantly, this part doesn't scale with model size, making it friendly for larger models. - Multi-Modal Self-supervised Learning based on Image-Lidar pair and UniTR. Please refer to the following figure. The images and point clouds both describe the same 3D scene; they complement each other in terms of highly informative correspondence. This allows for the unsupervised learning of more generic scene representation with shared parameters.
- Single-Modal Pretraining. Our model is almost the same as ViT (except for some position embedding strategies). If we adjust the position embedding appropriately, DSVT and UniTR can directly load the pretrained parameters of ViT. This is beneficial for better integration with the 2D community.
- Unifide Modeling of 3D Vision. Please refer to the following figure.
Possible Issues
- If you encounter a gradient that becomes NaN during fp16 training, not support.
- If you couldnโt find a solution, search open and closed issues in our github issues page here.
- We provide torch checkpoints option here in training stage by default for saving CUDA memory 50%.
- Samples in Nuscenes have some variations in camera parameters. So, during training, every sample recalculates the camera-lidar mapping, which significantly slows down the training speed (~40%). If the extrinsic parameters in your dataset are consistent, I recommend caching this computation during training.
- If still no-luck, open a new issue in our github. Our turnaround is usually a couple of days.
Citation
Please consider citing our work as follows if it is helpful.
@inproceedings{wang2023unitr,
title={UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation},
author={Haiyang Wang, Hao Tang, Shaoshuai Shi, Aoxue Li, Zhenguo Li, Bernt Schiele, Liwei Wang},
booktitle={ICCV},
year={2023}
}
Acknowledgments
UniTR uses code from a few open source repositories. Without the efforts of these folks (and their willingness to release their implementations), UniTR would not be possible. We thanks these authors for their efforts!