Training Vision Transformers for Semi-Supervised Semantic Segmentation

Xinting Hu, Li Jiang, Bernt Schiele; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 4007-4017

Abstract


We present S4Former a novel approach to training Vision Transformers for Semi-Supervised Semantic Segmentation (S4). At its core S4Former employs a Vision Transformer within a classic teacher-student framework and then leverages three novel technical ingredients: PatchShuffle as a parameter-free perturbation technique Patch-Adaptive Self-Attention (PASA) as a fine-grained feature modulation method and the innovative Negative Class Ranking (NCR) regularization loss. Based on these regularization modules aligned with Transformer-specific characteristics across the image input feature and output dimensions S4Former exploits the Transformer's ability to capture and differentiate consistent global contextual information in unlabeled images. Overall S4Former not only defines a new state of the art in S4 but also maintains a streamlined and scalable architecture. Being readily compatible with existing frameworks S4Former achieves strong improvements (up to 4.9%) on benchmarks like Pascal VOC 2012 COCO and Cityscapes with varying numbers of labeled data. The code is at https://github.com/JoyHuYY1412/S4Former.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Hu_2024_CVPR, author = {Hu, Xinting and Jiang, Li and Schiele, Bernt}, title = {Training Vision Transformers for Semi-Supervised Semantic Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {4007-4017} }