Asymmetric Masked Distillation for Pre-Training Small Foundation Models

Zhao, Zhiyu; Huang, Bingkun; Xing, Sen; Wu, Gangshan; Qiao, Yu; Wang, Limin

Zhiyu Zhao, Bingkun Huang, Sen Xing, Gangshan Wu, Yu Qiao, Limin Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18516-18526

Abstract

Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically taking inspiration from knowledge distillation in model compression we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy where the teacher model is enabled to see more context information with a lower masking ratio while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zhao_2024_CVPR, author = {Zhao, Zhiyu and Huang, Bingkun and Xing, Sen and Wu, Gangshan and Qiao, Yu and Wang, Limin}, title = {Asymmetric Masked Distillation for Pre-Training Small Foundation Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18516-18526} }