DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures

Steven Hogue, Chenxu Zhang, Hamza Daruger, Yapeng Tian, Xiaohu Guo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1922-1931

Abstract


Audio-driven talking video generation has advanced significantly but existing methods often depend on video-to-video translation techniques and traditional generative networks like GANs and they typically generate taking heads and co-speech gestures separately leading to less coherent outputs. Furthermore the gestures produced by these methods often appear overly smooth or subdued lacking in diversity and many gesture-centric approaches do not integrate talking head generation. To address these limitations we introduce DiffTED a new approach for one-shot audio-driven TED-style talking video generation from a single image. Specifically we leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model precisely controlling the avatar's animation while ensuring temporally coherent and diverse gestures. This innovative approach utilizes classifier-free guidance empowering the gestures to flow naturally with the audio input without relying on pre-trained classifiers. Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Hogue_2024_CVPR, author = {Hogue, Steven and Zhang, Chenxu and Daruger, Hamza and Tian, Yapeng and Guo, Xiaohu}, title = {DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1922-1931} }