-
[pdf]
[supp]
[bibtex]@InProceedings{Lee_2026_CVPR, author = {Lee, Seungeun and Moon, SeungJun and Lew, Hah Min and Kang, Ji-Su and Park, Gyeong-Moon}, title = {AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {3998-4010} }
AudioAvatar: Personalized Audio-driven Whole-body Talking Avatars
Abstract
Prior expressive whole-body conversational avatar systems map audio to parametric poses and then render, creating a lossy bottleneck where quantization, retargeting, and tracking errors accumulate. This degrades audio-motion synchronization and suppresses micro-articulations critical for realism--such as bilabial closures, cheek inflation, nasolabial motion, blinks, and fine hand gestures--especially under single-image personalization. We propose an end-to-end framework that builds a full-body, photorealistic conversational avatar from a single image and drives it directly from audio, bypassing intermediate pose prediction. The avatar is modeled as a particle-based deformation field of 3D Gaussian primitives in a canonical space, with an audio-conditioned dynamics module that outputs per-particle trajectories for face, hands, and body, enabling localized high-frequency control with globally coherent motion. A splat-based differentiable renderer preserves identity, texture, and photo realism, while feature-level distillation from a large audio-driven video diffusion model and weak supervision from synthetic audio-conditioned clips further improve synchronization and natural expressivity. Joint photometric and temporal objectives shape the audio-conditioned deformation and rendering. Experiments across diverse speakers show improved lip-audio sync, fine facial detail, and conversational gesture naturalness over pose-driven baselines.
Related Material

