Neutral emotion, predicted pose | Neutral emotion, transferred pose | Happy emotion, predicted pose | Surprise emotion, predicted pose |
---|---|---|---|
Neutral emotion, predicted pose | Neutral emotion, transferred pose | Sad emotion, predicted pose | Fear emotion, predicted pose |
Input image | Predicted normalized facial landmarks | Predicted posed facial landmarks | Predicted latent face-vid2vid keypoints | Final output video |
---|---|---|---|---|
![]() |
||||
![]() |
Source image | Happy 0.5 | Happy |
---|---|---|
![]() |
||
Source image | Angry 0.5 | Angry |
Input image | Blinking | Gaze change |
---|---|---|
![]() |
||
![]() |
Input image | PC-AVS | MakeItTalk | Wav2Lip | SPACE (ours) |
---|---|---|---|---|
![]() |
||||
![]() |
||||
![]() |
||||
![]() |
||||
![]() |
||||
![]() |
||||
Input image | PC-AVS | MakeItTalk | Wav2Lip | SPACE (ours) |
The results below compare the effect of removing certain inputs from the second stage of our
framework - Landmarks2Latents (L2L).
In our final design, we provide both the landmarks predicted from the first stage (Speech2Landmarks)
as well as the audio. The first column contains results when the audio is not provided to the L2L
model.
The second contains results when landmarks are not provided to the L2L model, which effectively
makes it a Speech2Latent model. The third column contains the results from our full model.
Not providing the audio as input reduces quality of the lip sync and facial motions, as there is no
way to add information not captured by just facial landmarks.
Not providing facial landmarks results in the output drifting from its original pose and
configuration.
No audio input (L2L-landmarks) |
No landmark input (S2Latents) | Audio + landmark input (SPACE) |
---|---|---|
The videos below show why facial landmark normalization is necessary to achieve good output quality
from the Speech2Landmarks network.
If we predict landmarks in a non-normalized space, the pose of the output facial landmarks does not
remain stable and keeps drifting.
More importantly, the lip motions are less pronounced, leading to poor lip-sync.
Input image | Without normalization (S2L-raw) |
With normalization (SPACE) |
---|---|---|
![]() |
||
![]() |
||
![]() |
SPACE can be used in video conferencing as sending only audio and a still image saves substantial
bandwidth over sending the full video. Due to its pose controllability, it can be combined with
existing approaches so that the user can freely switch between using video or audio inputs. In low
bandwidth scenarios, the video conferencing system can fall back to the audio-driven mode and still
generate realistic output videos.
Below, we show outputs from a system in which we use face-vid2vid and SPACE to produce an output
video. face-vid2vid fails when the input video face is occluded by hands or other objects. In such a
case, we fallback to SPACE. This is demonstrated in the last column, hybrid animation.
Input video | Audio-based animation | face-vid2vid based animation | Hybrid animation |
---|---|---|---|