LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization

Yu, Runyi; Wang, Zhennan; Wang, Yinhuai; Li, Kehan; Liu, Chang; Duan, Haoyi; Ji, Xiangyang; Chen, Jie

Runyi Yu, Zhennan Wang, Yinhuai Wang, Kehan Li, Chang Liu, Haoyi Duan, Xiangyang Ji, Jie Chen; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 5886-5896

Abstract

Position information is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operations. A typical way to introduce position information is adding the absolute Position Embedding (PE) to patch embedding before entering VTs. However, this approach operates the same Layer Normalization (LN) to token embedding and PE, and delivers the same PE to each layer. This results in restricted and monotonic PE across layers, as the shared LN affine parameters are not dedicated to PE, and the PE cannot be adjusted on a per-layer basis. To overcome these limitations, we propose using two independent LNs for token embeddings and PE in each layer, and progressively delivering PE across layers. By implementing this approach, VTs will receive layer-adaptive and hierarchical PE. We name our method as Layer-adaptive Position Embedding, abbreviated as LaPE, which is simple, effective, and robust. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate that LaPE significantly outperforms the default PE method. For example, LaPE improves +1.06% for CCT on CIFAR100, +1.57% for DeiT-Ti on ImageNet-1K, +0.7 box AP and +0.5 mask AP for ViT-Adapter-Ti on COCO, and +1.37 mIoU for tiny Segmenter on ADE20K. This is remarkable considering LaPE only increases negligible parameters, memory, and computational cost.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Yu_2023_ICCV, author = {Yu, Runyi and Wang, Zhennan and Wang, Yinhuai and Li, Kehan and Liu, Chang and Duan, Haoyi and Ji, Xiangyang and Chen, Jie}, title = {LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {5886-5896} }