Neighborhood Attention Transformers
Powerful hierarchical vision transformers based on sliding window attention.
Neighborhood Attention (NA, local attention) was introduced in our original paper, NAT, and runs efficiently with our extension to PyTorch, NATTEN.
We recently introduced a new model, DiNAT, which extends NA by dilating neighborhoods (DiNA, sparse global attention, a.k.a. dilated local attention).
Combinations of NA/DiNA are capable of preserving locality, maintaining translational equivariance, expanding the receptive field exponentially, and capturing longer-range inter-dependencies, leading to significant performance boosts in downstream vision tasks, such as StyleNAT for image generation.
News
March 25, 2023
- Neighborhood Attention Transformer was accepted to CVPR 2023!
November 18, 2022
- NAT and DiNAT are now available through HuggingFace's transformers.
November 11, 2022
- New preprint: StyleNAT: Giving Each Head a New Perspective.
October 8, 2022
- NATTEN is now available as a pip package!
- You can now install NATTEN with pre-compiled wheels, and start using it in seconds.
- NATTEN will be maintained and developed as a separate project to support broader usage of sliding window attention, even beyond computer vision.
September 29, 2022
- New preprint: Dilated Neighborhood Attention Transformer.
🔥
Dilated Neighborhood Attention
A new hierarchical vision transformer based on Neighborhood Attention (local attention) and Dilated Neighborhood Attention (sparse global attention) that enjoys significant performance boost in downstream tasks.
Check out the DiNAT README.
Neighborhood Attention Transformer
Our original paper, Neighborhood Attention Transformer (NAT), the first efficient sliding-window local attention.
How Neighborhood Attention works
Neighborhood Attention localizes the query token's (red) receptive field to its nearest neighboring tokens in the key-value pair (green). This is equivalent to dot-product self attention when the neighborhood size is identical to the image dimensions. Note that the edges are special (edge) cases.
Citation
@inproceedings{hassani2023neighborhood,
title = {Neighborhood Attention Transformer},
author = {Ali Hassani and Steven Walton and Jiachen Li and Shen Li and Humphrey Shi},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2023},
pages = {6185-6194}
}
@article{hassani2022dilated,
title = {Dilated Neighborhood Attention Transformer},
author = {Ali Hassani and Humphrey Shi},
year = 2022,
url = {https://arxiv.org/abs/2209.15001},
eprint = {2209.15001},
archiveprefix = {arXiv},
primaryclass = {cs.CV}
}
@article{walton2022stylenat,
title = {StyleNAT: Giving Each Head a New Perspective},
author = {Steven Walton and Ali Hassani and Xingqian Xu and Zhangyang Wang and Humphrey Shi},
year = 2022,
url = {https://arxiv.org/abs/2211.05770},
eprint = {2211.05770},
archiveprefix = {arXiv},
primaryclass = {cs.CV}
}