Making Vision Transformers Truly Shift-Equivariant

Renan A. Rojas-Gomez, Teck-Yian Lim, Minh N. Do, Raymond A. Yeh; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 5568-5577

Abstract


In the field of computer vision Vision Transformers (ViTs) have emerged as a prominent deep learning architecture. Despite being inspired by Convolutional Neural Networks (CNNs) ViTs are susceptible to small spatial shifts in the input data - they lack shift-equivariance. To address this shortcoming we introduce novel data-adaptive designs for each of the ViT modules that break shift-equivariance such as tokenization self-attention patch merging and positional encoding. With our proposed modules we achieve perfect circular shift-equivariance across four prominent ViT architectures: Swin SwinV2 CvT and MViTv2. Additionally we leverage our design to further enhance consistency under standard shifts. We evaluate our adaptive ViT models on image classification and semantic segmentation tasks. Our models achieve competitive performance across three diverse datasets showcasing perfect (100%) circular shift consistency while improving standard shift consistency.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Rojas-Gomez_2024_CVPR, author = {Rojas-Gomez, Renan A. and Lim, Teck-Yian and Do, Minh N. and Yeh, Raymond A.}, title = {Making Vision Transformers Truly Shift-Equivariant}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {5568-5577} }