-
[pdf]
[supp]
[bibtex]@InProceedings{Zhou_2023_ICCV, author = {Zhou, Aojun and Li, Yang and Qin, Zipeng and Liu, Jianbo and Pan, Junting and Zhang, Renrui and Zhao, Rui and Gao, Peng and Li, Hongsheng}, title = {SparseMAE: Sparse Training Meets Masked Autoencoders}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {16176-16186} }
SparseMAE: Sparse Training Meets Masked Autoencoders
Abstract
Masked Autoencoders (MAE) and its variants have proven to be effective for pretraining large-scale Vision Transformers (ViTs). However, small-scale models do not benefit from the pretraining mechanisms due to limited capacity. Sparse training is a method of transferring representations from large models to small ones by pruning unimportant parameters. However, naively combining MAE finetuning with sparse training make the network task-specific, resulting in the loss of task-agnostic knowledge, which is crucial for model generalization. In this paper, we aim to reduce model complexity from large vision transformers pretrained by MAE with assistant of sparse training. We summarize various sparse training methods to prune large vision transformers during MAE pretraining and finetuning stages, and discuss their shortcomings. To improve learning both task-agnostic and task-specific knowledge, we propose SparseMAE, a novel two-stage sparse training method that includes sparse pretraining and sparse finetuning. In sparse pretraining, we dynamically prune a small-scale sub-network from a ViT-Base. During finetuning, the sparse sub-network adaptively changes its topology connections under the task-agnostic knowledge of the full model. Extensive experimental results demonstrate the effectiveness of our method and its superiority on small-scale vision transformers. Code will be available at https://github.com/aojunzz/SparseMAE.
Related Material