Exploring Cross-Attention Maps in Multi-modal Diffusion Transformers for Training-Free Semantic Segmentation

Rento Yamaguchi, Keiji Yanai; Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, 2024, pp. 260-274

Abstract


This paper presents a novel training-free semantic segmentation method that leverages a pre-trained large-scale image generation model incorporating the Multi-modal Diffusion Transformer (MM-DiT) architecture. Inspired by training-free segmentation techniques using the U-Net-based noise removal model in the Stable Diffusion framework, our approach extracts cross-attention maps between textual and visual features during the inference stages of the MM-DiT to generate mask images. Experimental results demonstrate that our method achieves segmentation accuracy comparable to CLIP-based and U-Net-based stable diffusion methods. While the direct segmentation scores are relatively modest, the significance of our work lies in the exploration of crossattention maps within the DiT. This investigation provides critical insights that could advance training-free segmentation methodologies and enhance the interpretability of diffusion-based models.

Related Material


[pdf]
[bibtex]
@InProceedings{Yamaguchi_2024_ACCV, author = {Yamaguchi, Rento and Yanai, Keiji}, title = {Exploring Cross-Attention Maps in Multi-modal Diffusion Transformers for Training-Free Semantic Segmentation}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops}, month = {December}, year = {2024}, pages = {260-274} }