-
[pdf]
[bibtex]@InProceedings{Wei_2024_CVPR, author = {Wei, Zihao and Wei, Chen and Mei, Jieru and Bai, Yutong and Wang, Zeyu and Li, Xianhang and Zhu, Hongru and Wang, Huiyu and Yuille, Alan and Zhou, Yuyin and Xie, Cihang}, title = {Masked Autoencoders are Secretly Efficient Learners}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7986-7995} }
Masked Autoencoders are Secretly Efficient Learners
Abstract
This paper provides an efficiency study of training Masked Autoencoders (MAE) a framework introduced by He et. al. for pre-training Vision Transformers (ViTs). Our results surprisingly reveal that MAE can learn at a faster speed and with fewer training samples while maintaining high performance. To accelerate its training our changes are simple and straightforward: in the pre-training stage we aggressively increase the masking ratio decrease the number of training epochs and reduce the decoder depth to lower the pre-training cost; in the fine-tuning stage we demonstrate that layer-wise learning rate decay plays a vital role in unlocking the full potential of pre-trained models. Under this setup we further verify the sample efficiency of MAE: training MAE is hardly affected even when using only 20% of the original training set. By combining these strategies we are able to accelerate MAE pre-training by a factor of 82 or more with little performance drop. For example we are able to pre-train a ViT-B in 9 hours using a single NVIDIA A100 GPU and achieve 82.9% top-1 accuracy on the downstream ImageNet classification task. Additionally we also verify the speed acceleration on another MAE extension SupMAE.
Related Material