U-MixFormer: UNet-Like Transformer with Mix-Attention for Efficient Semantic Segmentation

Yeom, Seul-Ki; von Klitzing, Julian

Seul-Ki Yeom, Julian von Klitzing; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 7710-7719

Abstract

Semantic segmentation has witnessed remarkable advancements with the adaptation of the Transformer architecture. Parallel to the strides made by the Transformer CNN-based U-Net has seen significant progress especially in high-resolution medical imaging and remote sensing. This dual success inspired us to merge both strengths leading to the inception of a U-Net-based vision transformer decoder tailored for efficient contextual encoding. Here we propose a novel transformer decoder U-MixFormer built upon the U-Net structure designed for efficient semantic segmentation. Our approach distinguishes itself from the previous transformer methods by leveraging lateral connections between the encoder and decoder stages as feature queries for the attention modules apart from the traditional reliance on skip connections. Moreover we innovatively mix hierarchical feature maps from various encoder and decoder stages to form a unified representation for keys and values giving rise to our unique mix-attention module. Our approach demonstrates state-of-the-art performance across various configurations. Extensive experiments show that U-MixFormer outperforms SegFormer FeedFormer and SegNeXt by a large margin. For example U-MixFormer-B0 surpasses SegFormer-B0 and FeedFormer-B0 with 3.8% and 2.0% higher mIoU and 27.3% and 21.8% less computation and outperforms SegNext with 3.3% higher mIoU with MSCAN-T encoder on ADE20K. Code available at https://github.com/julian-klitzing/u-mixformer.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Yeom_2025_WACV, author = {Yeom, Seul-Ki and von Klitzing, Julian}, title = {U-MixFormer: UNet-Like Transformer with Mix-Attention for Efficient Semantic Segmentation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {7710-7719} }