Multimodaltrace: Deepfake Detection Using Audiovisual Representation Learning

Raza, Muhammad Anas; Malik, Khalid Mahmood

Muhammad Anas Raza, Khalid Mahmood Malik; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 993-1000

Abstract

By employing generative deep learning techniques, Deepfakes are created with the intent to create mistrust in society, manipulate public opinion and political decisions, and for other malicious purposes such as blackmail, scamming, and even cyberstalking. As realistic deepfake may involve manipulation of either audio or video or both, thus it is important to explore the possibility of detecting deepfakes through the inadequacy of generative algorithms to synchronize audio and visual modalities. Prevailing performant methods, either detect audio or video cues for deepfakes detection while few ensemble the results after predictions on both modalities without inspecting relationship between audio and video cues. Deepfake detection using joint audiovisual representation learning is not explored much. Therefore, this paper proposes a unified multimodal framework, Multimodaltrace, which extracts learned channels from audio and visual modalities, mixes them independently in IntrAmodality Mixer Layer (IAML), processes them jointly in IntErModality Mixer Layers (IEML) from where it is fed to multilabel classification head. Empirical results show the effectiveness of the proposed framework giving state-of-the-art accuracy of 92.9% on the FakeAVCeleb dataset. The cross-dataset evaluation of the proposed framework on World Leaders and Presidential Deepfake Detection Datasets gives an accuracy of 83.61% and 70% respectively. The study also provides insights into how the model focuses on different parts of audio and visual features through integrated gradient analysis.

Related Material

[pdf]

[bibtex]

@InProceedings{Raza_2023_CVPR, author = {Raza, Muhammad Anas and Malik, Khalid Mahmood}, title = {Multimodaltrace: Deepfake Detection Using Audiovisual Representation Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {993-1000} }