Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling

Yinuo Wang, Yanbo Fan, Xuan Wang, Guo Yu, Fei Wang; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 15885-15895

Abstract


Listening head generation aims to synthesize non-verbal responsive listening head videos that naturally react to a certain speaker, for which, both realistic head movements, expressive facial expressions, and high visual qualities are expected. Previous approaches typically follow a two-stage pipeline that first generates intermediate 3D motion signals such as 3DMM coefficients, and then synthesizes the videos by deterministic rendering, suffering from limited motion expressiveness and low visual quality (eg, 256 x256). In this work, we propose a novel listening head generation method that harnesses the generative capabilities of the diffusion model for both motion generation and high-quality rendering. Crucially, we propose an effective hybrid motion modeling module that addresses training difficulties caused by the scarcity of listening head data while preserving the intricate details that may be lost in explicit motion representations. We further develop a tailored control guidance for head pose and facial expression, by integrating their intrinsic motion characteristics. Our method enables high-fidelity video generation with 512x512 resolution and delivers vivid listener motion feedback. We conduct comprehensive experiments and obtain superior performance in terms of both visual quality and motion expressiveness compared with existing methods.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wang_2025_CVPR, author = {Wang, Yinuo and Fan, Yanbo and Wang, Xuan and Yu, Guo and Wang, Fei}, title = {Diffusion-based Realistic Listening Head Generation via Hybrid Motion Modeling}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {15885-15895} }