-
[pdf]
[supp]
[bibtex]@InProceedings{Kockwelp_2025_WACV, author = {Kockwelp, Jacqueline and Beckmann, Daniel and Risse, Benjamin}, title = {Human Gaze Improves Vision Transformers by Token Masking}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {396-405} }
Human Gaze Improves Vision Transformers by Token Masking
Abstract
Human attention plays a crucial role in visual perception and decision-making opening new possibilities for integration with machine learning models. While Transformer models excel in modeling global relationships via self-attention understanding the importance of specific image regions for their decision-making remains challenging. This paper investigates the intersection of human gaze and Transformer-based attention in the context of object classification tasks focusing on how gaze-prioritized regions correspond to Transformer attention. We extend the analysis of the attention mechanism during inference by focusing the attention of pretrained Vision Transformers to regions of interest which we derive directly from human gaze. Our findings indicate that gaze-based token masking can not only reduce the number of tokens necessary for robust model performance but might also improve classification accuracy over using the whole image for certain configurations. Even though this masking can improve model performance we show that both attention mechanisms have clear structural differences for natural images. Our results shed light on the relationship between human and Transformer attention providing novel perspectives for optimising Transformer models to achieve more efficient and interpretable image understanding and classification. Code is available at https://zivgitlab.uni-muenster.de/cvmls/gaze-based-token-masking.
Related Material