Lost in Translation: Lip-Sync Deepfake Detection from Audio-Video Mismatch

Matyas Bohacek, Hany Farid; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4315-4323

Abstract


Highly realistic voice cloning combined with AI-powered video manipulation allows for the creation of compelling lip-sync deepfakes where anyone can be made to say things they never did. The resulting fakes are being used to entertain but also for everything from election related disinformation to small- and large-scale fraud. Lip-sync deepfakes can be particularly difficult to detect because only the mouth and jaw of the person talking is modified. We describe a robust and general-purpose technique to detect these fakes. This technique begins by independently translating the audio (using audio-to-text transcription) and video (using automated lip-reading). We then show that the resulting transcriptions are significantly mismatched for lip-sync deepfakes as compared to authentic videos. The robustness of this technique is evaluated against a controlled dataset of our creation and in-the-wild fakes all of varying length and resolution.

Related Material


[pdf]
[bibtex]
@InProceedings{Bohacek_2024_CVPR, author = {Bohacek, Matyas and Farid, Hany}, title = {Lost in Translation: Lip-Sync Deepfake Detection from Audio-Video Mismatch}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4315-4323} }