FuseForm: Multimodal Transformer for Semantic Segmentation

Justin McMillen, Yasin Yilmaz; Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, 2025, pp. 618-627

Abstract


For semantic segmentation integrating multimodal data can vastly improve segmentation performance at the cost of increased model complexity. We introduce FuseForm a multimodal transformer for semantic segmentation which can effectively and efficiently fuse a large number of homogeneous modalities. We demonstrate its superior performance on 5 different multimodal datasets ranging from 2 to 12 modalities and comprehensively analyze its components. FuseForm outperforms existing methods through two novel features a hybrid multimodal fusion block and a transformer-based decoder. It leverages a multimodal cross-attention module for global token fusion alongside convolutional filters' ability to fuse local features. Global and local fusion modules together enable enhanced multimodal semantic segmentation. We also introduce a decoder based on a mirrored version of the encoder transformer which outperforms a popular decoder when tuned sufficiently on the dataset.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{McMillen_2025_WACV, author = {McMillen, Justin and Yilmaz, Yasin}, title = {FuseForm: Multimodal Transformer for Semantic Segmentation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {618-627} }