Multi-Level Contrastive Learning for Self-Supervised Vision Transformers

Shentong Mo, Zhun Sun, Chao Li; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2778-2787


Recent studies aim to establish contrastive self-supervised learning (CSL) algorithms specialized for the family of Vision Transformers (ViTs) to make them function normally as ordinary convolutional-based backbones in the training progress. Despite obtaining promising performance on related downstream tasks, one compelling property of the ViTs is ignored in those approaches. As previous studies have demonstrated, vision transformers benefit from the early stage global attention mechanics, obtaining feature representations that contain information from distant patches, even in their shallow layers. Motivated by this, we present a simple yet effective framework to facilitate the self-supervised feature learning of transformer-based vision architectures, namely, Multi-level Contrastive learning for Vision Transformers (MCVT). Specifically, we equip the vision transformers with individual-based (InfoNCE) and prototypical-based (ProtoNCE) contrastive loss in different stages of the architecture to capture low-level invariance and high-level invariance between views of samples, respectively. We conduct extensive experiments to demonstrate the effectiveness of the proposed method, using two well-known vision transformer backbones, on several vision downstream tasks, including linear classification, detection, and semantic segmentation.

Related Material

@InProceedings{Mo_2023_WACV, author = {Mo, Shentong and Sun, Zhun and Li, Chao}, title = {Multi-Level Contrastive Learning for Self-Supervised Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {2778-2787} }