SimMIM: A Simple Framework for Masked Image Modeling

Xie, Zhenda; Zhang, Zheng; Cao, Yue; Lin, Yutong; Bao, Jianmin; Yao, Zhuliang; Dai, Qi; Hu, Han

SimMIM: A Simple Framework for Masked Image Modeling

Zhenda Xie, Zheng Zhang, Yue Cao, Yutong Lin, Jianmin Bao, Zhuliang Yao, Qi Dai, Han Hu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 9653-9663

Abstract

This paper presents SimMIM, a simple framework for masked image modeling. We have simplified recently proposed relevant approaches, without the need for special designs, such as block-wise masking and tokenization via discrete VAE or clustering. To investigate what makes a masked image modeling task learn good representations, we systematically study the major components in our framework, and find that the simple designs of each component have revealed very strong representation learning performance: 1) random masking of the input image with a moderately large masked patch size (e.g., 32) makes a powerful pre-text task; 2) predicting RGB values of raw pixels by direct regression performs no worse than the patch classification approaches with complex designs; 3) the prediction head can be as light as a linear layer, with no worse performance than heavier ones. Using ViT-B, our approach achieves 83.8% top-1 fine-tuning accuracy on ImageNet-1K by pre-training also on this dataset, surpassing previous best approach by +0.6%. When applied to a larger model with about 650 million parameters, SwinV2-H, it achieves 87.1% top-1 accuracy on ImageNet-1K using only ImageNet-1K data. We also leverage this approach to address the data-hungry issue faced by large-scale model training, that a 3B model (SwinV2-G) is successfully trained to achieve state-of-the-art accuracy on four representative vision benchmarks using 40x less labeled data than that in previous practice (JFT-3B). The code is available at https://github.com/microsoft/SimMIM.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Xie_2022_CVPR, author = {Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han}, title = {SimMIM: A Simple Framework for Masked Image Modeling}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {9653-9663} }