i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?

Kevin Zhang, Zhiqiang Shen; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7740-7749

Abstract


Masked image modeling (MIM) has been recognized as a strong self-supervised pre-training approach in the vision domain. However the mechanism and properties of the learned representations by such a scheme as well as how to further enhance the representations are so far not well-explored. In this paper we aim to explore an interactive Masked Autoencoders (i-MAE) framework to enhance the representation capability from two aspects: (1) employing a two-way image reconstruction and a latent feature reconstruction with distillation loss to learn better features; (2) proposing a semantics-enhanced sampling strategy to boost the learned semantics in MAE. Upon the proposed i-MAE architecture we can address two critical questions to explore the behaviors of the learned representations in MAE: (1) Whether the separability of latent representations in Masked Autoencoders is helpful for model performance? We study it by forcing the input as a mixture of two images instead of one. (2) Whether we can enhance the representations in the latent feature space by controlling the degree of semantics during sampling on Masked Autoencoders? To this end we propose a sampling strategy within a mini-batch based on the semantics of training samples to examine this aspect. Extensive experiments are conducted on CIFAR-10/100 Tiny-ImageNet and ImageNet-1K datasets to verify the observations we discovered. Furthermore in addition to qualitatively analyzing the characteristics of the latent representations we examine the existence of linear separability and the degree of semantics in the latent space by proposing two evaluation schemes. The surprising and consistent results across the qualitative and quantitative experiments demonstrate that i-MAE is a superior framework design for understanding MAE frameworks as well as achieving better representational ability.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhang_2024_CVPR, author = {Zhang, Kevin and Shen, Zhiqiang}, title = {i-MAE: Are Latent Representations in Masked Autoencoders Linearly Separable?}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7740-7749} }