• Stars
    star
    128
  • Rank 279,417 (Top 6 %)
  • Language
    Python
  • Created almost 2 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

(CVPR2023) Integrally Pre-Trained Transformer Pyramid Networks

[CVPR2023] Integrally Pre-Trained Transformer Pyramid Networks

iTPN

Figure 1: The comparison between a conventional pre-training (left) and the proposed integral pre-training framework (right). We use a feature pyramid as the unified neck module and apply masked feature modeling for pre-training the feature pyramid. The green and red blocks indicate that the network weights are pre-trained and un-trained (i.e., randomly initialized for fine-tuning), respectively.

Updates

30/May/2023

model Pre-train teacher input/patch 21K ft? Acc on IN.1K
EVA-02-B IN.21K EVA-CLIP-g 196/14 N 87.0%
EVA-02-B IN.21K EVA-CLIP-g 448/14 N 88.3%
EVA-02-B IN.21K EVA-CLIP-g 448/14 Y 88.6%
Fast-iTPN-B IN.1K CLIP-L 224/16 N 87.4%
Fast-iTPN-B IN.1K CLIP-L 512/16 N 88.5%
Fast-iTPN-B IN.1K CLIP-L 512/16 Y 88.7%

All the models above are only pre-trained on ImageNet-1K and these models will be available soon.

29/May/2023

The iTPN-L-CLIP/16 intermediate fine-tuned model is available (password:itpn) pretrained on 21K, and fine-tuned on 1K. Evaluating the latter one on ImageNet-1K obtains 89.2% accuracy.

28/Feb./2023

iTPN is accepted by CVPR2023!

08/Feb./2023

The iTPN-L-CLIP/16 model reaches 89.2% fine-tuning performance on ImageNet-1K.

configurations: intermediate fine-tuning on ImageNet-21K + 384 input size

21/Jan./2023

Our HiViT is accepted by ICLR2023!

HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer

08/Dec./2022

Get checkpoints (password: abcd):

iTPN-B-pixel iTPN-B-CLIP iTPN-L-pixel iTPN-L-CLIP/16
baidu drive download download download download
google drive download download download download

25/Nov./2022

The preprint version is public at arxiv.

Requiments

  • Ubuntu
  • Python 3.7+
  • CUDA 10.2+
  • GCC 5+
  • Pytorch 1.7+

Dataset

  • ImageNet-1K
  • COCO2017
  • ADE20K

Get Started

Prepare the environment:

conda create --name itpn python=3.8 -y
conda activate itpn

git clone [email protected]:sunsmarterjie/iTPN.git
cd iTPN

pip install torch==1.7.1+cu10.2 torchvision==0.8.2+cu10.2 timm==0.3.2 tensorboard einops

iTPN supports pre-training using pixel and CLIP as supervision. For the latter, please first download the CLIP models (We use CLIP-B/16 and CLIP-L/14 models in the paper).

Main Results

iTPN

Table 1: Top-1 classification accuracy (%) by fine-tuning the pre-trained models on ImageNet-1K. We compare models of different levels and supervisions (e.g., with and without CLIP) separately.

iTPN

Table 2: Visual recognition results (%) on COCO and ADE20K. Mask R-CNN (abbr. MR, 1x/3x) and Cascade Mask R-CNN (abbr. CMR, 1x) are used on COCO, and UPerHead with 512x512 input is used on ADE20K. For the base-level models, each cell of COCO results contains object detection (box) and instance segmentation (mask) APs. For the large-level models, the accuracy of 1x Mask R-CNN surpasses all existing methods.

License

iTPN is released under the License.

Your star is my motivation to update, thanks!

Citation

@inproceedings{tian2023integrally,
  title={Integrally Pre-Trained Transformer Pyramid Networks},
  author={Tian, Yunjie and Xie, Lingxi and Wang, Zhaozhi and Wei, Longhui and Zhang, Xiaopeng and Jiao, Jianbin and Wang, Yaowei and Tian, Qi and Ye, Qixiang},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={18610--18620},
  year={2023}
}
@inproceedings{zhang2023hivit,
  title={HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer},
  author={Zhang, Xiaosong and Tian, Yunjie and Xie, Lingxi and Huang, Wei and Dai, Qi and Ye, Qixiang and Tian, Qi},
  booktitle={International Conference on Learning Representations},
  year={2023}
}