DVT: Denoising Vision Transformers

2024-01-19: We will release all of our stronger denoiser checkpoints within two weeks.

This is the official code release for

by Jiawei Yang†*, Katie Z Luo*, Jiefeng Li, Kilian Q. Weinberger, Yonglong Tian, and Yue Wang

* equal technical contribution † project lead

Abstract

We delve into a nuanced but significant challenge inherent to Vision Transformers (ViTs): feature maps of these models exhibit grid-like artifacts, which detrimentally hurt the performance of ViTs in downstream tasks. Our investigations trace this fundamental issue down to the positional embeddings at the input stage. To address this, we propose a novel noise model, which is universally applicable to all ViTs. Specifically, the noise model dissects ViT outputs into three components: a semantics term free from noise artifacts and two artifact-related terms that are conditioned on pixel locations. Such a decomposition is achieved by enforcing cross-view feature consistency with neural fields. This per-image optimization process extracts artifact-free features from raw ViT outputs, providing clean ViT features for offline applications. Furthermore, we introduce a learnable denoiser to predict artifact-free features directly from unprocessed ViT outputs, capable of generalizing to unseen data without the need for per-image optimization. Our two-stage approach, which we term as Denoising Vision Transformers (DVT), does not require re-training existing pre-trained ViTs, and is immediately applicable to any Transformer-based architectures. We evaluate our method on a variety of representative ViTs (DINO, MAE, DeiT-III, EVA02, CLIP, DINOv2, DINOv2-reg). Extensive evaluations demonstrate that our DVT consistently and significantly improves existing state-of-the-art general-purpose models in semantic and geometric tasks across multiple datasets (e.g., +3.84 mIoU). We hope our study will encourage a re-evaluation of ViT design, especially regarding the naive use of positional embeddings.

TL;DR:

We identify crucial artifacts in ViTs caused by positional embeddings and propose a two-stage approach to remove these artifacts, which significantly improves the feature quality of different pre-trained ViTs.

Citation

@article{yang2024denoising,
  author = {Jiawei Yang and Katie Z Luo and Jiefeng Li and Kilian Q Weinberger and Yonglong Tian and Yue Wang},
  title = {Denoising Vision Transformers},
  journal = {arXiv preprint arXiv:2401.02957},
  year = {2024},
}

Installation

Create a conda environment.

conda create -n dvt python=3.9 -y

Activate the environment.

conda activate dvt

Install dependencies from requirements.txt.

pip install -r requirements.txt

Install tiny-cuda-nn manually:

pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error nvcc fatal : Unsupported gpu architecture compute_89, try the following command:

TCNN_CUDA_ARCHITECTURES=86 pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch

If you encounter the error: parameter packs not expanded with ‘...’, Refer to this solution on GitHub.

Data preparation

PASCAL-VOC 2007 and 2012: Please download the PASCAL VOC07 and PASCAL VOC12 datasets (link) and put the data in the folder data, e.g.,

mkdir -p data
cd data
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2007/VOCtrainval_06-Nov-2007.tar
tar -xf VOCtrainval_06-Nov-2007.tar
wget http://host.robots.ox.ac.uk/pascal/VOC/voc2012/VOCtrainval_11-May-2012.tar
tar -xf VOCtrainval_11-May-2012.tar

In our experiments reported in the paper, we used the first 10,000 examples from data/voc_train.txt for stage-1 denoising. This text file was generated by gathering all JPG images from data/VOC2007/JPEGImages and data/VOC2012/JPEGImages, excluding the validation images, and then randomly shuffling them.

ADE20K: [legacy, need to check] Please download the ADE20K dataset and put the data in data/ADEChallengeData2016.
NYU-D: Please download the NYU-depth dataset and put the data in data/nyu. Results are provided given the 2014 annotations following previous works.
ImageNet (Optional):
- Download the ImageNet dataset from http://www.image-net.org/
- Extract data following the instructions at here.
- Put the data in data/imagenet.

Run the code:

See sample_scripts for examples of running the code.

We provide some demo outputs in demo/demo_outputs. For example, this image shows our denoising results on a cat image: From left to right, we show: (1) input crop, (2) raw DINOv2 base output, (3) Kmeans clustering of the raw output, (4) L2 feature norm of the raw output, (5) the similarity between the central patch and other patches in the raw output, (6) our denoised output, (7) Kmeans clustering of the denoised output, (8) L2 feature norm of the denoised output, (9) the similarity between the central patch and other patches in the denoised output, (10) the decomposed shared artifacts, (11) the L2 norm of the shared artifacts, (12) the ground-truth residual error, (13) the predicted residual term, and (13) the composition of the shared artifacts and the predicted residual term.

Main Results and Checkpoints

VOC Evaluation Results

	mIoU	aAcc	mAcc	Logfile
MAE	50.24	88.02	63.15	log
MAE + DVT	50.53	88.06	63.29	log
DINO	63.00	91.38	76.35	log
DINO + DVT	66.22	92.41	78.14	log
Registers	83.64	96.31	90.67	log
Registers + DVT	84.50	96.56	91.45	log
DeiT3	70.62	92.69	81.23	log
DeiT3 + DVT	73.36	93.34	83.74	log
EVA	71.52	92.76	82.95	log
EVA + DVT	73.15	93.43	83.55	log
CLIP	77.78	94.74	86.57	log
CLIP + DVT	79.01	95.13	87.48	log
DINOv2	83.60	96.30	90.82	log
DINOv2 + DVT	84.84	96.67	91.70	log

ADE20K Evaluation Results

	mIoU	aAcc	mAcc	Logfile
MAE	23.60	68.54	31.49	log
MAE + DVT	23.62	68.58	31.25	log
DINO	31.03	73.56	40.33	log
DINO + DVT	32.40	74.53	42.01	log
Registers	48.22	81.11	60.52	log
Registers + DVT	49.34	81.94	61.70	log
DeiT3	32.73	72.61	42.81	log
DeiT3 + DVT	36.57	74.44	49.01	log
EVA	37.45	72.78	49.74	log
EVA + DVT	37.87	75.02	49.81	log
CLIP	40.51	76.44	52.47	log
CLIP + DVT	41.10	77.41	53.07	log
DINOv2	47.29	80.84	59.18	log
DINOv2 + DVT	48.66	81.89	60.24	log

NYU-D Evaluation Results

	RMSE	Rel	Logfile
MAE	0.6695	0.2334	log
MAE + DVT	0.7080	0.2560	log
DINO	0.5832	0.1701	log
DINO + DVT	0.5780	0.1731	log
Registers	0.3969	0.1190	log
Registers + DVT	0.3880	0.1157	log
DeiT3	0.588	0.1788	log
DeiT3 + DVT	0.5891	0.1802	log
EVA	0.6446	0.1989	log
EVA + DVT	0.6243	0.1964	log
CLIP	0.5598	0.1679	log
CLIP + DVT	0.5591	0.1667	log
DINOv2	0.4034	0.1238	log
DINOv2 + DVT	0.3943	0.1200	log

Denoiser Checkpoints

[ ] To be released.

Jiawei-Yang/Denoising-ViT

Jiawei-Yang

Reviews

Repository Details