Discover szq0214/FKD Open Source project by Zhiqiang Shen (@szq0214)

🚀 FKD: A Fast Knowledge Distillation Framework for Visual Recognition

Official PyTorch implementation of paper A Fast Knowledge Distillation Framework for Visual Recognition (ECCV 2022, ECCV paper, arXiv), Zhiqiang Shen and Eric Xing.

Abstract

Knowledge Distillation (KD) has been recognized as a useful tool in many visual tasks, such as the supervised classification and self-supervised representation learning. While the main drawback of a vanilla KD framework lies in its mechanism that most of the computational overhead is consumed on forwarding through the giant teacher networks, which makes the whole learning procedure a low-efficient and costly manner.

🚀 Fast Knowledge Distillation (FKD) is a novel framework that addresses the low-efficiency drawback, simulates the distillation training phase, and generates soft labels following the multi-crop KD procedure, meanwhile enjoying a faster training speed than other methods. FKD is even more efficient than the conventional classification framework when employing multi-crop in the same image for data loading. It achieves 80.1% (SGD) and 80.5% (AdamW) using ResNet-50 on ImageNet-1K with plain training settings. This work also demonstrates the efficiency advantage of FKD on the self-supervised learning task.

Citation

@article{shen2021afast,
      title={A Fast Knowledge Distillation Framework for Visual Recognition}, 
      author={Zhiqiang Shen and Eric Xing},
      year={2021},
      journal={arXiv preprint arXiv:2112.01528}
}

What's New

Please refer to our work here if you would like to utilize mixture-based data augmentations (Mixup, CutMix, etc.) during the soft label generation and model training.
Includes code of soft label generation for customization. We will also set up a soft label zoo and baselines with multiple soft labels from various teachers.
FKD with AdamW on ResNet-50 achieves 80.5% using a plain training scheme. Pre-trained model is available here.

Supervised Training

Preparation

Install PyTorch and ImageNet dataset following the official PyTorch ImageNet training code. This repo has minimal modifications on that code.
Download our soft label and unzip it. We provide multiple types of soft labels, and we recommend to use Marginal Smoothing Top-5 (500-crop).
[Optional] Generate customized soft labels using ./FKD_SLG.

FKD Training on CNNs

To train a model, run train_FKD.py with the desired model architecture and the path to the soft label and ImageNet dataset:

python train_FKD.py -a resnet50 --lr 0.1 --num_crops 4 -b 1024 --cos --temp 1.0 --softlabel_path [soft label path] [imagenet-folder with train and val folders]

Add --mixup_cutmix to enable Mixup and Cutmix augmentations. For --softlabel_path, use format as ./FKD_soft_label_500_crops_marginal_smoothing_k_5/imagenet.

Multi-processing distributed training on a single node with multiple GPUs:

python train_FKD.py \
--dist-url 'tcp://127.0.0.1:10001' \
--dist-backend 'nccl' \
--multiprocessing-distributed --world-size 1 --rank 0 \
-a resnet50 --lr 0.1 --num_crops 4 -b 1024 \
--temp 1.0 --cos -j 32 \
--save_checkpoint_path ./FKD_nc_4_res50_plain \
--softlabel_path [soft label path, e.g., ./FKD_soft_label_500_crops_marginal_smoothing_k_5/imagenet] \
[imagenet-folder with train and val folders]

For multiple nodes multi-processing distributed training, please refer to official PyTorch ImageNet training code for details.

Evaluation

python train_FKD.py -a resnet50 -e --resume [model path] [imagenet-folder with train and val folders]

Training Speed Comparison

The training speed of each epoch is tested on HPC/CIAI cluster at MBZUAI with 8 NVIDIA V100 GPUs. The batch size is 1024 for all three methods: (i) regular/vanilla classification framework, (ii) Relabel and (iii) FKD. For Vanilla and ReLabel, we use the average of 10 epochs after the speed is stable. For FKD, we perform num_crops = 4 to calculate the average of (4 $\times$ 10) epochs, note that using 8 will give faster training speed. All other settings are the same for the comparison.

Method	Network	Training time per-epoch
Vanilla	ResNet-50	579.36 sec/epoch
ReLabel	ResNet-50	762.11 sec/epoch
FKD (Ours)	ResNet-50	486.77 sec/epoch

Trained Models

Method	Network	accuracy (Top-1)	weights	configurations
`ReLabel`	ResNet-50	78.9	--	--
`FKD`	ResNet-50	80.1^+1.2%	link	same as ReLabel while initial lr = 0.1 $\times$ $batch size \over 512$

`FKD`_(Plain)	ResNet-50	79.8	link	Table 12 in paper _{(w/o warmup&colorJ )}
`FKD`_(AdamW)	ResNet-50	80.5	link	Table 13 in paper _{(same as our settings on ViT and SReT)}

`ReLabel`	ResNet-101	80.7	--	--
`FKD`	ResNet-101	81.9^+1.2%	link	Table 12 in paper

`FKD`_(Plain)	ResNet-101	81.7	link	Table 12 in paper _{(w/o warmup&colorJ )}

Mobile-level Efficient Networks

Method	Network	FLOPs	accuracy (Top-1)	weights
`FBNet`	FBNet-c100	375M	75.12%	--
`FKD`	FBNet-c100	375M	77.13%^+2.01%	link

`EfficientNetv2`	EfficientNetv2-B0	700M	78.35%	--
`FKD`	EfficientNetv2-B0	700M	79.94%^+1.59%	link

The training protocol is the same as we used for ViT/SReT:

# Use the same settings as on ViT and SReT
cd train_ViT
# Train the model
python -u train_ViT_FKD.py \
--dist-url 'tcp://127.0.0.1:10001' \
--dist-backend 'nccl' \
--multiprocessing-distributed --world-size 1 --rank 0 \
-a tf_efficientnetv2_b0 \
--lr 0.002 --wd 0.05 \
--epochs 300 --cos -j 32 \
--num_classes 1000 --temp 1.0 \
-b 1024 --num_crops 4 \
--save_checkpoint_path ./FKD_nc_4_224_efficientnetv2_b0 \
--soft_label_type marginal_smoothing_k5  \
--softlabel_path [soft label path] \
[imagenet-folder with train and val folders]

FKD Training on ViT/DeiT and SReT

To train a ViT model, run train_ViT_FKD.py with the desired model architecture and the path to the soft label and ImageNet dataset:

cd train_ViT
python train_ViT_FKD.py \
--dist-url 'tcp://127.0.0.1:10001' \
--dist-backend 'nccl' \
--multiprocessing-distributed --world-size 1 --rank 0 \
-a SReT_LT --lr 0.002 --wd 0.05 --num_crops 4 \
--temp 1.0 -b 1024 --cos \
--softlabel_path [soft label path] \
[imagenet-folder with train and val folders]

For the instructions of SReT_LT model, please refer to SReT for details.

Evaluation

python train_ViT_FKD.py -a SReT_LT -e --resume [model path] [imagenet-folder with train and val folders]

Trained Models

Model	FLOPs	#params	accuracy (Top-1)	weights	configurations
`DeiT-T-distill`	1.3B	5.7M	74.5	--	--
`FKD ViT/DeiT-T`	1.3B	5.7M	75.2	link	Table 13 in paper
`SReT-LT-distill`	1.2B	5.0M	77.7	--	--
`FKD SReT-LT`	1.2B	5.0M	78.7	link	Table 13 in paper

Fast MEAL V2

Please see MEAL V2 for the instructions to run FKD with MEAL V2.

Self-supervised Representation Learning Using FKD

Please see FKD-SSL for the instructions to run FKD for SSL task.

Contact

Zhiqiang Shen (zhiqiangshen0214 at gmail.com or zhiqians at andrew.cmu.edu)

szq0214/FKD

szq0214

Reviews

Repository Details

🚀 FKD: A Fast Knowledge Distillation Framework for Visual Recognition

Abstract

Citation

What's New

Supervised Training

Preparation

FKD Training on CNNs

Evaluation

Training Speed Comparison

Trained Models

Mobile-level Efficient Networks

FKD Training on ViT/DeiT and SReT

Evaluation

Trained Models

Fast MEAL V2

Self-supervised Representation Learning Using FKD

Contact

More Repositories