-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Patro_2025_WACV, author = {Patro, Badri N. and Namboodiri, Vinay P. and Agneeswaran, Vijay S.}, title = {SpectFormer: Frequency and Attention is What You Need in a Vision Transformer}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {9525-9536} }
SpectFormer: Frequency and Attention is What You Need in a Vision Transformer
Abstract
Vision transformers have been applied successfully for image recognition tasks. There have been either multiheaded self-attention based (ViT [12] DeIT [54]) similar to the original work in textual models or more recently based on spectral layers (Fnet [29] GFNet [46] AFNO [15]). We hypothesize that spectral layers capture high-frequency information such as lines and edges while attention layers capture token interactions. We investigate this hypothesis through this work and observe that indeed mixing spectral and multi-headed attention layers provides a better transformer architecture. We thus propose the novel Spectformer architecture for vision transformers that has initial spectral and deeper multi-headed attention layers. We believe that the resulting representation allows the transformer to capture the feature representation appropriately and it yields improved performance over other transformer representations. For instance it improves the top-1 accuracy by 2% on ImageNet compared to both GFNet-H and LiT. SpectFormer-H-S reaches 84.25% top-1 accuracy on ImageNet-1K (state of the art for small version). Further Spectformer-H-L achieves 85.7% which is the state of the art for the comparable base version of the transformers. We further validated the SpectFormer performance in other scenarios such as transfer learning on standard datasets such as CIFAR-10 CIFAR-100 Oxford- IIIT-flower and Standford Car datasets. We then investigate its use in downstream tasks such as object detection and instance segmentation on the MS-COCO dataset and observe that Spectformer shows consistent performance that is comparable to the best backbones and can be further optimized and improved. The source code is available on this website https://github.com/badripatro/SpectFormers.
Related Material