EmoTalker: Audio Driven Emotion Aware Talking Head Generation

Xiaoqian Shen, Faizan Farooq Khan, Mohamed Elhoseiny; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 1900-1917

Abstract


Talking head synthesis aims to create videos of a person speaking with accurately synchronized lip movements and natural facial expressions that correspond to the driving audio. However, previous approaches have used reference frames or extra labels to control emotions and facial expressions, which disentangle utterance and expression and ignore the impact of audio fluctuations on face motions, e.g., head pose, facial expressions and emotions. In this work, we present EmoTalker, which generates arbitrary identities with diverse and natural facial expressions from audio, without relying on driving frames or emotion labels as input. To achieve this, we present frames as a sequence of 3D motion coefficients of 3DMM representation and separate them into lip-related coefficients and the remaining (head pose, expressions) as facial motions. To model lip movement, we start with a pre-trained audio encoder and map it to the corresponding lip representation. While for facial motions, we employ a two-stage training strategy: 1) We first project facial motions into a finite space of the codebook embedded with emotion-aware facial expression priors. 2) Moreover, a cross-modal Transformer is devised to explicitly model the correlations between audio and different types of facial motions. Experimental results and user studies show our model achieves state-of-the-art performance on the emotional audio-visual dataset and produces more realistic talking head videos with synchronized lip movement and vivid facial expressions. Our codes are available at \href https://github.com/xiaoqian-shen/EmoTalker https://github.com/xiaoqian-shen/EmoTalker .

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Shen_2024_ACCV, author = {Shen, Xiaoqian and Khan, Faizan Farooq and Elhoseiny, Mohamed}, title = {EmoTalker: Audio Driven Emotion Aware Talking Head Generation}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {1900-1917} }