Rethinking the Self-Attention in Vision Transformers

Kyungmin Kim, Bichen Wu, Xiaoliang Dai, Peizhao Zhang, Zhicheng Yan, Peter Vajda, Seon Joo Kim; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 3071-3075

Abstract


Self-attention is a corner stone for transformer models. However, our analysis shows that self-attention in vision transformer inference is extremely sparse. When applying a sparsity constraint, our experiments on image (ImageNet-1K) and video (Kinetics-400) understanding show we can achieve 95% sparsity on the self-attention maps while maintaining the performance drop to be less than 2 points. This motivates us to rethink the role of self-attention in vision transformer models.

Related Material


[pdf]
[bibtex]
@InProceedings{Kim_2021_CVPR, author = {Kim, Kyungmin and Wu, Bichen and Dai, Xiaoliang and Zhang, Peizhao and Yan, Zhicheng and Vajda, Peter and Kim, Seon Joo}, title = {Rethinking the Self-Attention in Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2021}, pages = {3071-3075} }