Sparse Multimodal Vision Transformer for Weakly Supervised Semantic Segmentation
Vision Transformers have proven their versatility and utility for complex computer vision tasks, such as land cover segmentation in remote sensing applications. While performing on par or even outperforming other methods like Convolutional Neural Networks (CNNs), Transformers tend to require even larger datasets with fine-grained annotations (e.g., pixel-level labels for land cover segmentation). To overcome this limitation, we propose a weakly-supervised vision Transformer that leverages image-level labels to learn a semantic segmentation task to reduce the human annotation load. We achieve this by slightly modifying the architecture of the vision Transformer through the use of gating units in each attention head to enforce sparsity during training and thereby retaining only the most meaningful heads. This allows us to directly infer pixel-level labels from image-level labels by post-processing the un-pruned attention heads of the model and refining our predictions by iteratively training a segmentation model with high fidelity. Training and evaluation on the DFC2020 dataset shows that our method not only generates high-quality segmentation masks using image-level labels, but also performs on par with fully-supervised training relying on pixel-level labels. Finally, our results show that our method is able to perform weakly-supervised semantic segmentation even on small-scale datasets.