FT2TF: First-Person Statement Text-To-Talking Face Generation

Xingjian Diao, Ming Cheng, Wayner Barrios, SouYoung Jin; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 4821-4830

Abstract


Talking face generation has gained immense popularity in the computer vision community with various applications including AR VR teleconferencing digital assistants and avatars. Traditional methods are mainly audio-driven which have to deal with the inevitable resource-intensive nature of audio storage and processing. To address such a challenge we propose FT2TF - First-Person Statement Text-To-Talking Face Generation a novel one-stage end-to-end pipeline for talking face generation driven by first-person statement text. Different from previous work our model only leverages visual and textual information without any other sources (e.g. audio/landmark/pose) during inference. Extensive experiments are conducted on LRS2 and LRS3 datasets and results on multi-dimensional evaluation metrics are reported. Both quantitative and qualitative results showcase that FT2TF outperforms existing relevant methods and reaches the state-of-the-art. This achievement highlights our model's capability to bridge first-person statements and dynamic face generation providing insightful guidance for future work.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Diao_2025_WACV, author = {Diao, Xingjian and Cheng, Ming and Barrios, Wayner and Jin, SouYoung}, title = {FT2TF: First-Person Statement Text-To-Talking Face Generation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {4821-4830} }