Detecting Deep-Fake Videos From Aural and Oral Dynamics

Shruti Agarwal, Hany Farid; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 981-989

Abstract


A face-swap deep fake replaces a person's face -- from eyebrows to chin -- with another face. A lip-sync deep fake replaces a person's mouth region to be consistent with an impersonated or synthesized audio track. An overlooked aspect in the creation of these deep-fake videos is the human ear. Statically, the shape of the human ear has been shown to provide a biometric signal. Dynamically, movement of the mandible (lower jaw) causes changes in the shape of the ear and ear canal. While the facial identity in a face-swap deep fake may accurately depict the co-opted identity, the ears belong to the original identity. While the mouth in a lip-sync deep fake may be well synchronized with the audio, the dynamics of the ear motion will be de-coupled from the mouth and jaw motion. We describe a forensic technique that exploits these static and dynamic aural properties.

Related Material


[pdf]
[bibtex]
@InProceedings{Agarwal_2021_CVPR, author = {Agarwal, Shruti and Farid, Hany}, title = {Detecting Deep-Fake Videos From Aural and Oral Dynamics}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2021}, pages = {981-989} }