Fake It to Make It: Using Synthetic Data to Remedy the Data Shortage in Joint Multimodal Speech-and-Gesture Synthesis

Shivam Mehta, Anna Deichler, Jim O'regan, Birger Moëll, Jonas Beskow, Gustav Eje Henter, Simon Alexanderson; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1952-1964

Abstract


Although humans engaged in face-to-face conversation simultaneously communicate both verbally and non-verbally methods for joint and unified synthesis of speech audio and co-speech 3D gesture motion from text are a new and emerging field. These technologies hold great promise for more human-like efficient expressive and robust synthetic communication but are currently held back by the lack of suitably large datasets as existing methods are trained on parallel data from all constituent modalities. Inspired by student-teacher methods we propose a straightforward solution to the data shortage by simply synthesising additional training material. Specifically we use unimodal synthesis models trained on large datasets to create multimodal (but synthetic) parallel training data and then pre-train a joint synthesis model on that material. In addition we propose a new synthesis architecture that adds better and more controllable prosody modelling to the state-of-the-art method in the field. Our results confirm that pre-training on large amounts of synthetic data improves the quality of both the speech and the motion synthesised by the multimodal model with the proposed architecture yielding further benefits when pre-trained on the synthetic data.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Mehta_2024_CVPR, author = {Mehta, Shivam and Deichler, Anna and O'regan, Jim and Mo\"ell, Birger and Beskow, Jonas and Henter, Gustav Eje and Alexanderson, Simon}, title = {Fake It to Make It: Using Synthetic Data to Remedy the Data Shortage in Joint Multimodal Speech-and-Gesture Synthesis}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1952-1964} }