-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Rojas-Gomez_2024_CVPR, author = {Rojas-Gomez, Renan A. and Lim, Teck-Yian and Do, Minh N. and Yeh, Raymond A.}, title = {Making Vision Transformers Truly Shift-Equivariant}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {5568-5577} }
Making Vision Transformers Truly Shift-Equivariant
Abstract
In the field of computer vision Vision Transformers (ViTs) have emerged as a prominent deep learning architecture. Despite being inspired by Convolutional Neural Networks (CNNs) ViTs are susceptible to small spatial shifts in the input data - they lack shift-equivariance. To address this shortcoming we introduce novel data-adaptive designs for each of the ViT modules that break shift-equivariance such as tokenization self-attention patch merging and positional encoding. With our proposed modules we achieve perfect circular shift-equivariance across four prominent ViT architectures: Swin SwinV2 CvT and MViTv2. Additionally we leverage our design to further enhance consistency under standard shifts. We evaluate our adaptive ViT models on image classification and semantic segmentation tasks. Our models achieve competitive performance across three diverse datasets showcasing perfect (100%) circular shift consistency while improving standard shift consistency.
Related Material