-
[pdf]
[bibtex]@InProceedings{Yoo_2025_WACV, author = {Yoo, Jiwon and Ko, Dami and Kim, Gyeonghwan}, title = {CCASeg: Decoding Multi-Scale Context with Convolutional Cross-Attention for Semantic Segmentation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {9461-9470} }
CCASeg: Decoding Multi-Scale Context with Convolutional Cross-Attention for Semantic Segmentation
Abstract
Capturing multi-scale context within feature maps is crucial for semantic segmentation. With the success of the Vision Transformer (ViT) recent models have been designed with transformer decoders to capture it. However these models face limitations in utilizing diverse contextual information due to the inherent nature of the attention mechanism and structural constraints. Typically multi-head attention which leads to similar receptive fields for each token feture is achieved at the expense of significantly increased computational cost. The nature of the structure can cause inconsistent combination of the information across different levels. To address this issue in this paper we propose a novel and effective decoding scheme CCASeg which is based on convolutional cross-attention (CCA). The proposed CCA along with the decoding structure is devised not only to capture both local and global context through convolutional kernels of various sizes but also to achieve high efficiency by effective utilization of the cheap convolution operations. Moreover the decoding structure which ensures the successive combination of information across various levels facilitates understanding of diverse contexts. Consequently this novel decoding scheme enables feature maps to effectively learn the relationships between objects of different sizes. In this way our proposed CCASeg outperforms previous state-of-the-art methods on popular semantic segmentation benchmarks including ADE20K Cityscapes COCO-stuff and iSAID.
Related Material