Perceptual Synchronization Scoring of Dubbed Content Using Phoneme-Viseme Agreement

Honey Gupta; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2024, pp. 392-402

Abstract


Recent works have shown great success in synchronizing lip-movements in a given video with a dubbed audio stream. However, comparison and efficacy of the synchronization capabilities of these methods is still weakly substantiated due to the lack of a generalized and visually-grounded evaluation method. This work proposes a simple and grounded algorithm - PhoVis, that can measure synchronization and the perceived quality of a dubbed video at an utterance-level. The approach generates expected visemes by considering a speaker's lip-pose history and the phoneme in the dubbed audio. A sync distance and a perceptual score is then derived by comparing the generated viseme with the clip's visemes with the help of spatially grounded pose-distances. PhoVis is built upon the most basic audio-video elements i.e. phonemes and visemes to compute agreement, which makes it a domain independent algorithm that can be used to score both original and lip-synthesized videos, allowing measurement of dubbing as well as video-synthesis quality. We demonstrate that PhoVis achieves better generalization across languages, is aptly tailored for lip-sync measurement and can measure audio-lip correlation better than the existing AV sync methods.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Gupta_2024_WACV, author = {Gupta, Honey}, title = {Perceptual Synchronization Scoring of Dubbed Content Using Phoneme-Viseme Agreement}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2024}, pages = {392-402} }