Audio-driven Neural Gesture Reenactment with Video Motion Graphs- Supplementary Material - |
|
In this webpage, we show our synthesized video results as well as comparisons to other SOTA methods and baselines to better demonstrate the effectiveness our approach. |
1. Reenacted Video Results | |||||||||||
Example 1: our system reenacts the reference video with target speech audio. | |||||||||||
| + | = | ||||||||||
| Reference Video - Speaker A (a short sample clip) |
Target Audio - Speaker B | Reenacted Video - A's appearance + B's voice | |||||||||
Example 2: reenacted results on TED-talks speaker. | |||||||||||
| + | = | ||||||||||
| Reference Video - Speaker II (a short sample clip) |
Target Audio | Reenacted Video - Reenacted Speaker II's video | |||||||||
2. Our Blended Playback vs. Direct PlaybackWe show the GIF result for Fig.3 in the main paper (played in 3x slow motion). | |||||||||||
|
|
→ |
|
|
|||||||
| Input Sequence 1 | Input Sequence 2 |
Direct Playback
(notice the temporal jitter in the middle) |
Blended Playback
(ours seamless blended result) |
||||||||
|
Another example on TED-talks dataset. | |||||||||||
|
|
→ |
|
|
|||||||
| Input Sequence 1 | Input Sequence 2 |
Direct Playback
(notice the temporal jitter in the middle) |
Blended Playback
(ours seamless blended result) |
||||||||
3. Alternative Transitions on Video Motion GraphGiven the same input sequence, we show alternative transitions searched by proposed video motion graph (see Sec. 3.1). | |||||||||||
|
|||||||||||
| Same Input Sequence A | Searched Alternative Sequence B1 | Our Synthesized Blended Playback A → B1 |
|||||||||
|
|||||||||||
| Same Input Sequence A | Searched Alternative Sequence B2 | Our Synthesized Blended Playback A → B2 |
|||||||||
4. Alternative Reenacted Results from Beam Search | |||||||||||
Given a reference video and a target audio, our system can synthesize alternative reenacted results from different paths in the motion graph. Specifically, we select the most plausible K paths found by beam search described in Section 3.3. | |||||||||||
|
Reference Video
Reference Audio
|
→ |
Result from Searched Path 1
Result from Searched Path 4
|
Result from Searched Path 2
Result from Searched Path 5
|
Result from Searched Path 3
Result from Searched Path 6
|
|||||||
5. Comparison: Ours vs. GT | |||||||||||||
|
Below we show several videos either from our reenacted results or ground-truth videos. | |||||||||||||
↑ Click to show answer! ↑ |
↑ Click to show answer! ↑ |
↑ Click to show answer! ↑ |
|||||||||||
↑ Click to show answer! ↑ |
↑ Click to show answer! ↑ |
↑ Click to show answer! ↑ |
|||||||||||
↑ Click to show answer! ↑ |
↑ Click to show answer! ↑ |
↑ Click to show answer! ↑ |
|||||||||||
6. Comparison: Ours vs. Other SOTA Methods for Video Synthesis Results | |||||||||||
Here we show the video results for Fig.6 in the main paper. The videos have no sound. | |||||||||||
| ↓ GT ↓ | ↓ FeatureFlow[25] ↓ | ↓ SuperSlMo[32] ↓ | |||||||||
| Example 1 | |||||||||||
| ↑ Ours ↑ | ↑ vUnet[22] ↑ | ↑ EBDance[16] ↑ | |||||||||
| ‑ | |||||||||||
| ↓ GT ↓ | ↓ FeatureFlow[25] ↓ | ↓ SuperSlMo[32] ↓ | |||||||||
| Example 2 | |||||||||||
| ↑ Ours ↑ | ↑ vUnet[22] ↑ | ↑ EBDance[16] ↑ | |||||||||
7. Comparison: Blended Playback by Ours vs. Other SOTA Methods | |||||||||||||||
We compare our pose-aware blending network and other state-of-the-art methods for synthesizing blended frames around transition points (shown in GIF results played in 3x slow motion). | |||||||||||||||
Input Sequence 1
Input Sequence 2
|
→ |
FeatureFlow[25]
vUnet[22]
|
SuperSlMo[32]
EBDance[16]
|
Ours
|
|||||||||||
8. LimitationOur video blending network can blend the foreground human poses and slight background changes, but it fails on considerably changed backgrounds. | |||||||||||
|
|
→ |
|
||||||||
| Input Sequence 1 | Input Sequence 2 |
Blended Playback
Limitation: background cannot be blended smoothly when it changes considerably. |
|||||||||
Besides, our video blending results are also affected by the off-the-shelf motion capture method [72]. If the captured SMPL model is not accurate, e.g. the limbs, fingers are not well-aligned with image, our blending network which utilizes the foreground image warping based on the captured mesh may not achieve the best performance. | |||||||||||
References[16] Caroline Chan, Shiry Ginosar, Tinghui Zhou, and Alexei A Efros. Everybody dance now. In Proc. ICCV, 2019. [22] Patrick Esser, Ekaterina Sutter, and Björn Ommer. A variational u-net for conditional appearance and shape generation. In Proc. CVPR, 2018. [25] Shurui Gui, Chaoyue Wang, Qihua Chen, and Dacheng Tao. Featureflow: robust video interpolation via structure-to-texture generation. In Proc. CVPR, 2020. [32] Huaizu Jiang, Deqing Sun, Varun Jampani, Ming-Hsuan Yang, Erik Learned-Miller, and Jan Kautz. Super slomo: High quality estimation of multiple intermediate frames for video interpolation. In Proc. CVPR, 2018. [72] Donglai Xiang, Hanbyul Joo, and Yaser Sheikh. Monocular total capture: Posing face, body, and hands in the wild. In Proc. CVPR, 2019. **Note: reference indices are same to the ones used in the main paper. |