Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting

Liu, Haipeng; Wang, Yang; Qian, Biao; Wang, Meng; Rui, Yong

Haipeng Liu, Yang Wang, Biao Qian, Meng Wang, Yong Rui; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 8038-8047

Abstract

Denoising diffusion probabilistic models (DDPMs) for image inpainting aim to add the noise to the texture of the image during the forward process and recover the masked regions with the unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation the existing arts suffer from the semantic discrepancy between the masked and unmasked regions since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process leading to the large discrepancy between them. In this paper we aim to answer how the unmasked semantics guide the texture denoising process; together with how to tackle the semantic discrepancy to facilitate the consistent and meaningful semantics generation. To this end we propose a novel structure-guided diffusion model for image inpainting named StrDiffusion to reformulate the conventional texture denoising process under the structure guidance to derive a simplified denoising objective for image inpainting while revealing: 1) the semantically sparse structure is beneficial to tackle the semantic discrepancy in the early stage while the dense texture generates the reasonable semantics in the late stage; 2) the semantics from the unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process benefiting from the time-dependent sparsity of the structure semantics. For the denoising process a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions. Besides we devise an adaptive resampling strategy as a formal criterion as whether the structure is competent to guide the texture denoising process while regulate their semantic correlations. Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts. Our code is available at https://github.com/htyjers/StrDiffusion.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Liu_2024_CVPR, author = {Liu, Haipeng and Wang, Yang and Qian, Biao and Wang, Meng and Rui, Yong}, title = {Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {8038-8047} }