Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance

Dazhong Shen, Guanglu Song, Zeyue Xue, Fu-Yun Wang, Yu Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9370-9379

Abstract


Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models where the CFG scale is introduced to control the strength of text guidance on the whole image space. However we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem we present a novel approach Semantic-aware Classifier-Free Guidance (S-CFG) to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token while the self-attention map is used to complete the semantic regions. Then to balance the amplification of diverse semantic units we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Shen_2024_CVPR, author = {Shen, Dazhong and Song, Guanglu and Xue, Zeyue and Wang, Fu-Yun and Liu, Yu}, title = {Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {9370-9379} }