DiVAS: Video and Audio Synchronization with Dynamic Frame Rates

Clara Fernandez-Labrador, Mertcan Akçay, Eitan Abecassis, Joan Massich, Christopher Schroers; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26846-26854

Abstract


Synchronization issues between audio and video are one of the most disturbing quality defects in film production and live broadcasting. Even a discrepancy as short as 45 millisecond can degrade the viewer's experience enough to warrant manual quality checks over entire movies. In this paper we study the automatic discovery of such issues. Specifically we focus on the alignment of lip movements with spoken words targeting realistic production scenarios which can include background noise and music intricate head poses excessive makeup or scenes with multiple individuals where the speaker is unknown. Our model's robustness also extends to various media specifications including different video frame rates and audio sample rates. To address these challenges we present a model fully based on transformers that encodes face crops or full video frames and raw audio using timestamp information identifies the speaker and provides highly accurate synchronization predictions much faster than previous methods.

Related Material


[pdf]
[bibtex]
@InProceedings{Fernandez-Labrador_2024_CVPR, author = {Fernandez-Labrador, Clara and Ak\c{c}ay, Mertcan and Abecassis, Eitan and Massich, Joan and Schroers, Christopher}, title = {DiVAS: Video and Audio Synchronization with Dynamic Frame Rates}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26846-26854} }