Audio Transformer for Synthetic Speech Detection via Multi-Formant Analysis

Luca Cuccovillo, Milica Gerhardt, Patrick Aichroth; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4409-4417

Abstract


This paper introduces a novel multi-task transformer for detecting synthetic speech. The network encodes magnitude and phase of the input speech with a feature bottleneck used to autoencode the input magnitude to predict the trajectory of the first phonetic formants (F0 F1 F2) and to distinguish whether the input speech is synthetic or natural. The approach achieves state-of-the-art performance on the ASVspoof 2019 LA dataset with an AUC score of 0.932 while ensuring interpretability at the same time.

Related Material


[pdf]
[bibtex]
@InProceedings{Cuccovillo_2024_CVPR, author = {Cuccovillo, Luca and Gerhardt, Milica and Aichroth, Patrick}, title = {Audio Transformer for Synthetic Speech Detection via Multi-Formant Analysis}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4409-4417} }