Attention-Guided Masked Autoencoders for Learning Image Representations

Leon Sick, Dominik Engel, Pedro Hermosilla, Timo Ropinski; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 836-846

Abstract


Masked autoencoders (MAEs) have established themselves as a powerful pre-training method for computer vision tasks. While vanilla MAEs put equal emphasis on reconstructing the individual parts of the image we propose to inform the reconstruction process through an attention-guided loss function. By leveraging advances in unsupervised object discovery we obtain an attention map of the scene which we employ in the loss function to put increased emphasis on reconstructing relevant objects. Thus we incentivize the model to learn improved representations of the scene for a variety of tasks. Our evaluations show that our pre-trained models produce off-the-shelf representations more effective than the vanilla MAE for such tasks demonstrated by improved linear probing and k-NN classification results on several benchmarks while at the same time making ViTs more robust against varying backgrounds and changes in texture.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Sick_2025_WACV, author = {Sick, Leon and Engel, Dominik and Hermosilla, Pedro and Ropinski, Timo}, title = {Attention-Guided Masked Autoencoders for Learning Image Representations}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {836-846} }