Bandit Based Attention Mechanism in Vision Transformers

Amartya Roy Chowdhury, Raghuram Bharadwaj Diddigi, Prabuchandran K J, Achyut Mani Tripathi; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 9579-9588

Abstract


Vision Transformers (ViT) have demonstrated remarkable performance on many computer vision tasks. However their high computational cost and quadratic complexity pose challenges for deployment in resource-constrained environments. The core of Vision Transformers is the self-attention mechanism which aggregates information from different image regions or patches. In a conventional ViT processing involves attention to all patches creating a substantial computational bottleneck and extended training times. We hypothesize that applying soft attention to all patches may be unnecessary and instead focusing on relevant and significant patches (hard attention) would be sufficient. To address this we introduce a module within the Vision Transformer that allows the attention mechanism to selectively process only the essential patches. We propose a novel bandit-based attention mechanism that leverages the idea of exploration and exploitation. The extensive experimentation across various datasets illustrates that the proposed bandit attention-based ViT not only achieves superior performance compared to the existing state-of-the-art vision transformer models but also results in greater throughput and lower computational time in the training as well as the inference. The code is publicly available at https://github.com/aquorio15/bandit wacv

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Chowdhury_2025_WACV, author = {Chowdhury, Amartya Roy and Diddigi, Raghuram Bharadwaj and J, Prabuchandran K and Tripathi, Achyut Mani}, title = {Bandit Based Attention Mechanism in Vision Transformers}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {9579-9588} }