DiffiT: Diffusion Vision Transformers for Image Generation
Official PyTorch implementation of DiffiT: Diffusion Vision Transformers for Image Generation.
Code and pretrained DiffiT models will be released soon !
DiffiT achieves a new SOTA FID score of 1.73 on ImageNet-256 dataset !
In addition, DiffiT sets a new SOTA FID score of 2.22 on FFHQ-64 dataset !
We introduce a new Time-dependent Multihead Self-Attention (TMSA) mechanism that jointly learns spatial and temporal dependencies and allows for attention conditioning with finegrained control.
π₯ News π₯
- [12.04.2023] π₯ DiffiT manuscript is now available on arXiv !
Benchmarks
Latent Space
ImageNet-256
Model | Dataset | Resolution | FID-50K | Inception Score |
---|---|---|---|---|
Latent DiffiT | ImageNet | 256x256 | 1.73 | 276.49 |
ImageNet-512
Model | Dataset | Resolution | FID-50K | Inception Score |
---|---|---|---|---|
Latent DiffiT | ImageNet | 512x512 | 2.67 | 252.12 |
Image Space
Model | Dataset | Resolution | FID-50K |
---|---|---|---|
DiffiT | CIFAR-10 | 32x32 | 1.95 |
DiffiT | FFHQ-64 | 64x64 | 2.22 |