-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Drobyshev_2024_CVPR, author = {Drobyshev, Nikita and Casademunt, Antoni Bigata and Vougioukas, Konstantinos and Landgraf, Zoe and Petridis, Stavros and Pantic, Maja}, title = {EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {8498-8507} }
EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars
Abstract
Head avatars animated by visual signals have gained popularity particularly in cross-driving synthesis where the driver differs from the animated character a challenging but highly practical approach. The recently presented MegaPortraits model has demonstrated state-of-the-art results in this domain. We conduct a deep examination and evaluation of this model with a particular focus on its latent space for facial expression descriptors and uncover several limitations with its ability to express intense face motions. Head avatars animated by visual signals have gained popularity particularly in cross-driving synthesis where the driver differs from the animated character a challenging but highly practical approach. The recently presented MegaPortraits model has demonstrated state-of-the-art results in this domain. We conduct a deep examination and evaluation of this model with a particular focus on its latent space for facial expression descriptors and uncover several limitations with its ability to express intense face motions. To address these limitations we propose substantial changes in both training pipeline and model architecture to introduce our EMOPortraits model where we: Enhance the model's capability to faithfully support intense asymmetric face expressions setting a new state-of-the-art result in the emotion transfer task surpassing previous methods in both metrics and quality. Incorporate speech-driven mode to our model achieving top-tier performance in audio-driven facial animation making it possible to drive source identity through diverse modalities including visual signal audio or a blend of both.Furthermore we propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions filling the gap with absence of such data in existing datasets.
Related Material