Multi-modal Arousal and Valence Estimation under Noisy Conditions

Denis Dresvyanskiy, Maxim Markitantov, Jiawei Yu, Heysem Kaya, Alexey Karpov; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4773-4783

Abstract


Automatic emotion recognition has gained significant attention over the past two decades due to the central role that emotions play in human communication. While multi-modal systems demonstrate high performances on laboratory-controlled data their validity on non-lab-controlled namely `in-the-wild' data remains a challenge. This work investigates audio-visual deep learning approaches for emotion recognition in-the-wild with a particular focus on the effectiveness of architectures based on fine-tuned Convolutional Neural Networks (CNN) and Public Dimensional Emotion Model (PDEM) for video and audio modality respectively. We explore and compare various temporal modeling techniques (e.g. transformer architectures) and fusion strategies by leveraging the embeddings from developed multi-stage trained modality-specific Deep Neural Networks (DNN). The results are reported on the AffWild2 dataset following the Affective Behavior Analysis in-the-Wild 2024 (ABAW'24) challenge protocol. Our investigation highlights the complexities of robust multi-modal emotion recognition in an unconstrained environment providing insights into the usage of various deep learning architectures for tackling this challenging task.

Related Material


[pdf]
[bibtex]
@InProceedings{Dresvyanskiy_2024_CVPR, author = {Dresvyanskiy, Denis and Markitantov, Maxim and Yu, Jiawei and Kaya, Heysem and Karpov, Alexey}, title = {Multi-modal Arousal and Valence Estimation under Noisy Conditions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4773-4783} }