Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion

Xiao Han, Xiatian Zhu, Jiankang Deng, Yi-Zhe Song, Tao Xiang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 22768-22777

Abstract


Controllable person image synthesis aims at rendering a source image based on user-specified changes in body pose or appearance. Prior art approaches leverage pixel-level denoising diffusion models conditioned on the coarse skeleton via cross-attention. This leads to two limitations: low efficiency and inaccurate condition information. To address both issues, a novel Pose-Constrained Latent Diffusion model (PoCoLD) is introduced. Rather than using the skeleton as a sparse pose representation, we exploit DensePose which offers much richer body structure information. To effectively capitalize DensePose at a low cost, we propose an efficient pose-constrained attention module that is capable of modeling the complex interplay between appearance and pose. Extensive experiments show that our PoCoLD outperforms the state-of-the-art competitors in image synthesis fidelity. Critically, it runs 2x faster and consumes 3.6x smaller memory than the latest diffusion-model-based alternative during inference.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Han_2023_ICCV, author = {Han, Xiao and Zhu, Xiatian and Deng, Jiankang and Song, Yi-Zhe and Xiang, Tao}, title = {Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {22768-22777} }