SPACE : Speech-driven Portrait Animation with Controllable Expression
Supplementary videos results with audio


Teaser video
Multiple outputs for same inputs


Neutral emotion, predicted pose Neutral emotion, transferred pose Happy emotion, predicted pose Surprise emotion, predicted pose
Neutral emotion, predicted pose Neutral emotion, transferred pose Sad emotion, predicted pose Fear emotion, predicted pose


Intermediate output visualization
Sample inputs along with intermediate and final outputs


Input image Predicted normalized facial landmarks Predicted posed facial landmarks Predicted latent face-vid2vid keypoints Final output video


Emotion control results
Changing the emotion label and intensity


Source image Happy 0.5 Happy
Source image Angry 0.5 Angry



Blinking and Eye gaze controllability

Input image Blinking Gaze change

Comparison with prior work
Outputs given fixed head pose


Input image PC-AVS MakeItTalk Wav2Lip SPACE (ours)
Input image PC-AVS MakeItTalk Wav2Lip SPACE (ours)



Ablations

The results below compare the effect of removing certain inputs from the second stage of our framework - Landmarks2Latents (L2L). In our final design, we provide both the landmarks predicted from the first stage (Speech2Landmarks) as well as the audio. The first column contains results when the audio is not provided to the L2L model. The second contains results when landmarks are not provided to the L2L model, which effectively makes it a Speech2Latent model. The third column contains the results from our full model.
Not providing the audio as input reduces quality of the lip sync and facial motions, as there is no way to add information not captured by just facial landmarks. Not providing facial landmarks results in the output drifting from its original pose and configuration.

No audio input
(L2L-landmarks)
No landmark input (S2Latents) Audio + landmark input (SPACE)



The videos below show why facial landmark normalization is necessary to achieve good output quality from the Speech2Landmarks network. If we predict landmarks in a non-normalized space, the pose of the output facial landmarks does not remain stable and keeps drifting. More importantly, the lip motions are less pronounced, leading to poor lip-sync.



Input image Without normalization
(S2L-raw)
With normalization (SPACE)



Novel application

SPACE can be used in video conferencing as sending only audio and a still image saves substantial bandwidth over sending the full video. Due to its pose controllability, it can be combined with existing approaches so that the user can freely switch between using video or audio inputs. In low bandwidth scenarios, the video conferencing system can fall back to the audio-driven mode and still generate realistic output videos.
Below, we show outputs from a system in which we use face-vid2vid and SPACE to produce an output video. face-vid2vid fails when the input video face is occluded by hands or other objects. In such a case, we fallback to SPACE. This is demonstrated in the last column, hybrid animation.

Input video Audio-based animation face-vid2vid based animation Hybrid animation