On the Audio-visual Synchronization for Lip-to-Speech Synthesis

Zhe Niu, Brian Mak; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 7843-7852

Abstract


Most lip-to-speech (LTS) synthesis models are trained and evaluated with the assumption that the audio-video pairs in the dataset are well synchronized. In this work, we demonstrate that commonly used audiovisual datasets such as GRID, TCD-TIMIT, and Lip2Wav can, however, have the data asynchrony issue, which will lead to inaccurate evaluation with conventional time alignment-sensitive metrics such as STOI, ESTOI, and MCD. Moreover, training an LTS model with such datasets can result in model asynchrony, meaning that the generated speech and input video are out of sync. To address these problems, we first provide a time-alignment frontend for the commonly used metrics to ensure accurate evaluation. Then, we propose a synchronized lip-to-speech (SLTS) model with an automatic synchronization mechanism (ASM) that corrects data asynchrony and penalizes model asynchrony during training. We evaluated the effectiveness of our approach on both artificial and popular audiovisual datasets. Our proposed method outperforms existing SOTA models in a variety of evaluation metrics.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Niu_2023_ICCV, author = {Niu, Zhe and Mak, Brian}, title = {On the Audio-visual Synchronization for Lip-to-Speech Synthesis}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {7843-7852} }