DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers

Xianing Chen, Qiong Cao, Yujie Zhong, Jing Zhang, Shenghua Gao, Dacheng Tao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 12052-12062

Abstract


Transformers have been successfully applied to computer vision due to its powerful modelling capacity with self-attention. However, the good performance of transformers heavily depends on enormous training images. Thus, a data-efficient transformer solution is urgently needed. In this work, we propose an early knowledge distillation framework, which is termed as DearKD, to improvethe data-efficiency required by transformers. Our DearKD is a two-stage framework that first distills the inductive biases from the early intermediate layers of a CNN and then gives the transformer full play by training without distillation. Further, our DearKD can also be applied to the extreme data-free case where no real images are available, where we propose a boundary-preserving intra-divergence loss based on DeepInversion to further close the performance gap against the full-data counterpart. Extensive experiments on ImageNet, partial ImageNet, data-free setting and other downstream tasks prove the superiority of DearKD over its baselines and state-of-the-art methods.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chen_2022_CVPR, author = {Chen, Xianing and Cao, Qiong and Zhong, Yujie and Zhang, Jing and Gao, Shenghua and Tao, Dacheng}, title = {DearKD: Data-Efficient Early Knowledge Distillation for Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {12052-12062} }