DeCAtt: Efficient Vision Transformers With Decorrelated Attention Heads

Mayukh Bhattacharyya, Soumitri Chattopadhyay, Sayan Nag; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 4695-4699

Abstract


The advent of Vision Transformers (ViT) has led to significant performance gains across various computer vision tasks over the last few years, surpassing the de facto standard CNN architectures. However, most of the prominent variations of Vision Transformers are resource-intensive architectures with huge parameter sizes. They are known to be data-hungry and overfit quickly on comparatively smaller datasets. Consequently, this holds back their widespread usage across low-resource settings, which brings forth the need to develop resource-efficient vision transformers. To this end, we introduce a regularization loss that prioritizes efficient utilization of model parameters by decorrelating the heads of a multi-headed attention block in a vision transformer. This forces the heads to learn distinct features rather than focus on the same ones. Using this loss provides a consistent performance improvement over a wide range of varying scenarios of models and datasets as we show in our experiments, which proves its superior effectiveness.

Related Material


[pdf]
[bibtex]
@InProceedings{Bhattacharyya_2023_CVPR, author = {Bhattacharyya, Mayukh and Chattopadhyay, Soumitri and Nag, Sayan}, title = {DeCAtt: Efficient Vision Transformers With Decorrelated Attention Heads}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {4695-4699} }