EmoVOCA: Speech-Driven Emotional 3D Talking Heads

Federico Nocentini, Claudio Ferrari, Stefano Berretti; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 2859-2868

Abstract


A notable challenge in 3D talking head generation consists in blending speech-related motions with expression dynamics. This is primarily caused by the lack of comprehensive 3D datasets that combine diversity in spoken sentences with a variety of facial expressions. Some literature works attempted to overcome such lack of data by fitting parametric 3D models (3DMMs) to 2D videos and using the reconstructed 3D faces as replacement. However their underlying parametric space limits the precision required to accurately reproduce convincing lip motions and synching which is crucial for the application at hand. In this work we look at the problem from a different perspective and developed a data-driven technique to combine inexpressive 3D talking heads with a set of 3D expressive sequences which we used for creating a synthetic dataset called EmoVOCA. We then designed and trained an emotional 3D talking head generator that accepts a 3D face an audio file an emotion label and an intensity value as inputs and learns to animate the audio-synchronized lip movements with expressive traits of the face. Comprehensive experiments both quantitative and qualitative using our data and generator evidence superior ability in synthesizing convincing animations when compared with the best performing methods in the literature. Our code and pre-trained models are available at https://github.com/miccunifi/EmoVOCA.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Nocentini_2025_WACV, author = {Nocentini, Federico and Ferrari, Claudio and Berretti, Stefano}, title = {EmoVOCA: Speech-Driven Emotional 3D Talking Heads}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {2859-2868} }