Class-agnostic Object Detection with Multi-modal Transformer (ECCV 2022)
Class-agnostic Object Detection with Multi-modal Transformer
Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fahad Shahbaz Khan, Rao Muhammad Anwer and Ming-Hsuan Yang
π News
- (July 06, 2022)
- Paper accepted at ECCV 2022
- (Feb 01, 2022)
- Training codes for
MAVL
andMAVL minus Language
models are released->
training/README.md - Instructions to use class-agnostic object detection behavior of MAVL on different applications are released
->
applications/README.md - All the pretrained models (
MAVL
,Def-DETR
,MDETR
,DETReg
,Faster-RCNN
,RetinaNet
,ORE
, and others), along with the instructions to reproduce the results are released->
this link
- Training codes for
- (Nov 25, 2021) Evaluation code along with pre-trained models & pre-computed predictions is released. evaluation/README.md
Abstract: What constitutes an object? This has been a long-standing question in computer vision. Towards this goal, numerous learning-free and learning-based approaches have been developed to score objectness. However, they generally do not scale well across new domains and novel objects. In this paper, we advocate that existing methods lack a top-down supervision signal governed by human-understandable semantics. For the first time in literature, we demonstrate that Multi-modal Vision Transformers (MViT) trained with aligned image-text pairs can effectively bridge this gap. Our extensive experiments across various domains and novel objects show the state-of-the-art performance of MViTs to localize generic objects in images. Based on the observation that existing MViTs do not include multi-scale feature processing and usually require longer training schedules, we develop an efficient MViT architecture using multi-scale deformable attention and late vision-language fusion. We show the significance of MViT proposals in a diverse range of applications including open-world object detection, salient and camouflage object detection, supervised and self-supervised detection tasks. Further, MViTs can adaptively generate proposals given a specific language query and thus offer enhanced interactability.
Architecture overview of MViTs used in this work
Architecture overview of MViTs used in this work β GPV-1, MDETR and Multiscale Attention ViT with Late fusion (MAVL) (ours).
Installation
The code is tested with PyTorch 1.8.0 and CUDA 11.1. After cloning the repository, follow the below steps for installation,
- Install PyTorch and torchvision
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 -f https://download.pytorch.org/whl/torch_stable.html
- Install other dependencies
pip install -r requirements.txt
- Compile Deformable Attention modules
cd models/ops
sh make.sh
Results
Results of Class-agnostic Object Detection of MViTS including our proposed Multiscale Attention ViT with Late fusion (MAVL) model, applications, and exploratory analysis.
Class-agnostic Object Detection performance of MViTs in comparison with bottom-up approaches and uni-modal detectors on five natural image OD datasets. MViTs show consistently good results on all datasets.
Generalization to New Domains: Class-agnostic OD performance of MViTs in comparison with uni-modal detector(RetinaNet) on five out-of-domain OD datasets. MViTs show consistently good results on all datasets.
Generalization to Rare/Novel Classes: MAVL class-agnostic OD performance on rarely and frequently occurring categories in the pretraining captions. The numbers on top of the bars indicate occurrences of the corresponding category in the training dataset. The MViT achieves good recall values even for the classes with no or very few occurrences.
Enhanced Interactability: Effect of using different intuitive text queries on the MAVL class-agnostic OD performance. Combining detections from multiple queries captures varying aspects of objectness.
Language Skeleton/Structure: Experimental analysis to explore the contribution of language by removing all textual inputs, but maintaining the structure introduced by captions. All experiments are performed on Def-DETR. In setting 1, annotations corresponding to same images are combined. Setting 2 has an additional NMS applied to remove duplicate boxes. In setting 3, four to eight boxes are randomly grouped in each iteration. The same model is trained longer in setting 4. In setting 5, the dataloader structure corresponding to captions is kept intact. Results from setting 5 demonstrate the importance of structure introduced by language.
Open-world Object Detection: Effect of using class-agnostic OD proposals from MAVL for pseudo labelling of unknowns in Open World Detector (ORE).
Pretraining for Class-aware Object Detection: Effect of using MAVL proposals for pre-training of DETReg instead of Selective Search proposals.
Evaluation
Please refer to evaluation/class_agnostic_od/README.md.
Training
Please refer to training/README.md.
Applications
Please refer to applications/README.md.
Citation
If you use our work, please consider citing:
@inproceedings{Maaz2022Multimodal,
title={Class-agnostic Object Detection with Multi-modal Transformer},
author={Maaz, Muhammad and Rasheed, Hanoona and Khan, Salman and Khan, Fahad Shahbaz and Anwer, Rao Muhammad and Yang, Ming-Hsuan},
booktitle={17th European Conference on Computer Vision (ECCV)},
year={2022},
organization={Springer}
}
Contact
Should you have any question, please create an issue on this repository or contact at [email protected], [email protected]