Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation

Ryota Yoshihashi, Yuya Otsuka, Kenji Doi, Tomohiro Tanaka, Hirokatsu Kataoka; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2300-2318

Abstract


The advance of generative models for images has inspired various training techniques for image recognition utilizing synthetic images. In semantic segmentation, one promising approach is extracting pseudo-masks from attention maps in text-to-image diffusion models, which enables real-image-and-annotation-free training. However, the pioneering training methods using the diffusion-synthetic images and pseudo-masks, e.g., DiffuMask have limitations in terms of mask quality, scalability, and ranges of applicable domains. To address these limitations, we propose a new framework to view diffusion-synthetic semantic segmentation training as a weakly supervised learning problem, where we utilize potentially inaccurate attentive information within the generative model as supervision. Motivated by this perspective, we first introduce reliability-aware robust training, originally used as a classifier-based WSSS method, with modification to handle generative attentions. Additionally, we propose techniques to boost the weakly supervised synthetic training: We introduce prompt augmentation by synonym-and-hyponym replacement, which is data augmentation to the prompt text set to scale up and diversify training images with limited text resources. Finally, LoRA-based adaptation of Stable Diffusion enables the transfer to a distant domain, e.g., auto-driving images. Experiments in PASCAL VOC, ImageNet-S, and Cityscapes show that our method effectively closes gap between real and synthetic training in semantic segmentation.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Yoshihashi_2024_ACCV, author = {Yoshihashi, Ryota and Otsuka, Yuya and Doi, Kenji and Tanaka, Tomohiro and Kataoka, Hirokatsu}, title = {Exploring Limits of Diffusion-Synthetic Training with Weakly Supervised Semantic Segmentation}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2300-2318} }