DVT: Denoising Vision Transformers
2024-01-19: We will release all of our stronger denoiser checkpoints within two weeks.
This is the official code release for
Denoising Vision Transformers.
by Jiawei Yangβ *, Katie Z Luo*, Jiefeng Li, Kilian Q. Weinberger, Yonglong Tian, and Yue Wang
Paper | Arxiv | Project Page
* equal technical contribution β project lead
Abstract
We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean ViT features for offline applications. Furthermore, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, capable of generalizing to unseen data without the need for per-image optimization. Our two-stage approach, which we term as Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs, and is immediately applicable to any Transformer-based architectures. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.
TL;DR:
We identify crucial artifacts in ViTs caused by positional embeddings and propose a two-stage approach to remove these artifacts, which significantly improves the feature quality of different pre-trained ViTs.
Citation
@article{yang2024denoising,
author = {Jiawei Yang and Katie Z Luo and Jiefeng Li and Kilian Q Weinberger and Yonglong Tian and Yue Wang},
title = {Denoising Vision Transformers},
journal = {arXiv preprint arXiv:2401.02957},
year = {2024},
}
Installation
- Create a conda environment.
conda create -n dvt python=3.9 -y
- Activate the environment.
conda activate dvt
- Install dependencies from requirements.txt.
pip install -r requirements.txt
- Install
tiny-cuda-nn
manually:
pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
If you encounter the error nvcc fatal : Unsupported gpu architecture compute_89
, try the following command:
TCNN_CUDA_ARCHITECTURES=86 pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
If you encounter the error: parameter packs not expanded with β...β
, Refer to this solution on GitHub.
Data preparation
- PASCAL-VOC 2007 and 2012:
Please download the PASCAL VOC07 and PASCAL VOC12 datasets (link) and put the data in the folder
data
, e.g.,
mkdir -p data
cd data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
tar -xf VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xf VOCtrainval_11-May-2012.tar
In our experiments reported in the paper, we used the first 10,000 examples from data/voc_train.txt
for stage-1 denoising. This text file was generated by gathering all JPG images from data/VOC2007/JPEGImages
and data/VOC2012/JPEGImages
, excluding the validation images, and then randomly shuffling them.
-
ADE20K: [legacy, need to check] Please download the ADE20K dataset and put the data in
data/ADEChallengeData2016
. -
NYU-D: Please download the NYU-depth dataset and put the data in
data/nyu
. Results are provided given the 2014 annotations following previous works. -
ImageNet (Optional):
- Download the ImageNet dataset from http://www.image-net.org/
- Extract data following the instructions at here.
- Put the data in
data/imagenet
.
Run the code:
See sample_scripts
for examples of running the code.
We provide some demo outputs in demo/demo_outputs. For example, this image shows our denoising results on a cat image: From left to right, we show: (1) input crop, (2) raw DINOv2 base output, (3) Kmeans clustering of the raw output, (4) L2 feature norm of the raw output, (5) the similarity between the central patch and other patches in the raw output, (6) our denoised output, (7) Kmeans clustering of the denoised output, (8) L2 feature norm of the denoised output, (9) the similarity between the central patch and other patches in the denoised output, (10) the decomposed shared artifacts, (11) the L2 norm of the shared artifacts, (12) the ground-truth residual error, (13) the predicted residual term, and (13) the composition of the shared artifacts and the predicted residual term.
Main Results and Checkpoints
VOC Evaluation Results
mIoU | aAcc | mAcc | Logfile | |
---|---|---|---|---|
MAE | 50.24 | 88.02 | 63.15 | log |
MAE + DVT | 50.53 | 88.06 | 63.29 | log |
DINO | 63.00 | 91.38 | 76.35 | log |
DINO + DVT | 66.22 | 92.41 | 78.14 | log |
Registers | 83.64 | 96.31 | 90.67 | log |
Registers + DVT | 84.50 | 96.56 | 91.45 | log |
DeiT3 | 70.62 | 92.69 | 81.23 | log |
DeiT3 + DVT | 73.36 | 93.34 | 83.74 | log |
EVA | 71.52 | 92.76 | 82.95 | log |
EVA + DVT | 73.15 | 93.43 | 83.55 | log |
CLIP | 77.78 | 94.74 | 86.57 | log |
CLIP + DVT | 79.01 | 95.13 | 87.48 | log |
DINOv2 | 83.60 | 96.30 | 90.82 | log |
DINOv2 + DVT | 84.84 | 96.67 | 91.70 | log |
ADE20K Evaluation Results
mIoU | aAcc | mAcc | Logfile | |
---|---|---|---|---|
MAE | 23.60 | 68.54 | 31.49 | log |
MAE + DVT | 23.62 | 68.58 | 31.25 | log |
DINO | 31.03 | 73.56 | 40.33 | log |
DINO + DVT | 32.40 | 74.53 | 42.01 | log |
Registers | 48.22 | 81.11 | 60.52 | log |
Registers + DVT | 49.34 | 81.94 | 61.70 | log |
DeiT3 | 32.73 | 72.61 | 42.81 | log |
DeiT3 + DVT | 36.57 | 74.44 | 49.01 | log |
EVA | 37.45 | 72.78 | 49.74 | log |
EVA + DVT | 37.87 | 75.02 | 49.81 | log |
CLIP | 40.51 | 76.44 | 52.47 | log |
CLIP + DVT | 41.10 | 77.41 | 53.07 | log |
DINOv2 | 47.29 | 80.84 | 59.18 | log |
DINOv2 + DVT | 48.66 | 81.89 | 60.24 | log |
NYU-D Evaluation Results
RMSE | Rel | Logfile | |
---|---|---|---|
MAE | 0.6695 | 0.2334 | log |
MAE + DVT | 0.7080 | 0.2560 | log |
DINO | 0.5832 | 0.1701 | log |
DINO + DVT | 0.5780 | 0.1731 | log |
Registers | 0.3969 | 0.1190 | log |
Registers + DVT | 0.3880 | 0.1157 | log |
DeiT3 | 0.588 | 0.1788 | log |
DeiT3 + DVT | 0.5891 | 0.1802 | log |
EVA | 0.6446 | 0.1989 | log |
EVA + DVT | 0.6243 | 0.1964 | log |
CLIP | 0.5598 | 0.1679 | log |
CLIP + DVT | 0.5591 | 0.1667 | log |
DINOv2 | 0.4034 | 0.1238 | log |
DINOv2 + DVT | 0.3943 | 0.1200 | log |
Denoiser Checkpoints
[ ] To be released.