Improving Pixel-based MIM by Reducing Wasted Modeling Capability

Yuan Liu, Songyang Zhang, Jiacheng Chen, Zhaohui Yu, Kai Chen, Dahua Lin; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5361-5372

Abstract


There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Liu_2023_ICCV, author = {Liu, Yuan and Zhang, Songyang and Chen, Jiacheng and Yu, Zhaohui and Chen, Kai and Lin, Dahua}, title = {Improving Pixel-based MIM by Reducing Wasted Modeling Capability}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {5361-5372} }