Lightweight Vision Transformer with Spatial and Channel Enhanced Self-Attention

Jiahao Zheng, Longqi Yang, Yiying Li, Ke Yang, Zhiyuan Wang, Jun Zhou; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 1492-1496

Abstract


Due to the large number of parameters and high computational complexity, Vision Transformer (ViT) is not suitable for deployment on mobile devices. As a result, the design of efficient vision transformer models has become the focus of many studies. In this paper, we introduce a novel technique called Spatial and Channel Enhanced Self-Attention (SCSA) for lightweight vision transformers. Specially, we utilize multi-head self-attention and convolutional attention in parallel to extract global spatial features and local spatial features, respectively. Subsequently, a fusion module based on channel attention effectively combines the extracted features from both global and local contexts. Based on SCSA, we introduce the Spatial and Channel enhanced Attention Transformer (SCAT). On the ImageNet- 1k dataset, SCAT achieves a top-1 accuracy of 76.6% with approximately 4.9M parameters and 0.7G FLOPs, outperforming state-of-the-art Vision Transformer architectures when the number of parameters and FLOPs are similar.

Related Material


[pdf]
[bibtex]
@InProceedings{Zheng_2023_ICCV, author = {Zheng, Jiahao and Yang, Longqi and Li, Yiying and Yang, Ke and Wang, Zhiyuan and Zhou, Jun}, title = {Lightweight Vision Transformer with Spatial and Channel Enhanced Self-Attention}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {1492-1496} }