Speech-Driven 3D Facial Animation With Implicit Emotional Awareness: A Deep Learning Approach

Hai X. Pham, Samuel Cheung, Vladimir Pavlovic; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017, pp. 80-88

Abstract


We introduce a long short-term memory recurrent neural network (LSTM-RNN) approach for real-time facial animation, which automatically estimates head rotation and facial action unit activations of a speaker from just her speech. Specifically, the time-varying contextual non-linear mapping between audio stream and visual facial movements is realized by training a LSTM neural network on a large audio-visual data corpus. In this work, we extract a set of acoustic features from input audio, including Mel-scaled spectrogram, Mel frequency cepstral coefficients and chromagram that can effectively represent both contextual progression and emotional intensity of the speech. Output facial movements are characterized by 3D rotation and blending expression weights of a blendshape model, which can be used directly for animation. Thus, even though our model does not explicitly predict the affective states of the target speaker, her emotional manifestation is recreated via expression weights of the face model. Experiments on an evaluation dataset of different speakers across a wide range of affective states demonstrate promising results of our approach in real-time speech-driven facial animation.

Related Material


[pdf]
[bibtex]
@InProceedings{Pham_2017_CVPR_Workshops,
author = {Pham, Hai X. and Cheung, Samuel and Pavlovic, Vladimir},
title = {Speech-Driven 3D Facial Animation With Implicit Emotional Awareness: A Deep Learning Approach},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {July},
year = {2017}
}