Visual Transformers: Where Do Transformers Really Belong in Vision Models?

Bichen Wu, Chenfeng Xu, Xiaoliang Dai, Alvin Wan, Peizhao Zhang, Zhicheng Yan, Masayoshi Tomizuka, Joseph E. Gonzalez, Kurt Keutzer, Peter Vajda; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 599-609

Abstract


A recent trend in computer vision is to replace convolutions with transformers. However, the performance gain of transformers is attained at a steep cost, requiring GPU years and hundreds of millions of samples for training. This excessive resource usage compensates for a misuse of transformers: Transformers densely model relationships between its inputs -- ideal for late stages of a neural network, when concepts are sparse and spatially-distant, but extremely inefficient for early stages of a network, when patterns are redundant and localized. To address these issues, we leverage the respective strengths of both operations, building convolution-transformer hybrids. Critically, in sharp contrast to pixel-space transformers, our Visual Transformer (VT) operates in a semantic token space, judiciously attending to different image parts based on context. Our VTs significantly outperforms baselines: On ImageNet, our VT-ResNets outperform convolution-only ResNet by 4.6 to 7 points and transformer-only ViT-B by 2.6 points with 2.5 times fewer FLOPs, 2.1 times fewer parameters. For semantic segmentation on LIP and COCO-stuff, VT-based feature pyramid networks (FPN) achieve 0.35 points higher mIoU while reducing the FPN module's FLOPs by 6.5x.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wu_2021_ICCV, author = {Wu, Bichen and Xu, Chenfeng and Dai, Xiaoliang and Wan, Alvin and Zhang, Peizhao and Yan, Zhicheng and Tomizuka, Masayoshi and Gonzalez, Joseph E. and Keutzer, Kurt and Vajda, Peter}, title = {Visual Transformers: Where Do Transformers Really Belong in Vision Models?}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {599-609} }