Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction

Tobias Hallmen, Fabian Deuser, Norbert Oswald, Elisabeth André; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4657-4665

Abstract


In this research we introduce a novel methodology for assessing Emotional Mimicry Intensity (EMI) as part of the 6th Workshop and Competition on Affective Behavior Analysis in-the-wild. Our methodology utilises the Wav2Vec 2.0 architecture which has been pre-trained on an extensive podcast dataset to capture a wide array of audio features that include both linguistic and paralinguistic components. We refine our feature extraction process by employing a fusion technique that combines individual features with a global mean vector thereby embedding a broader contextual understanding into our analysis. A key aspect of our approach is the multi-task fusion strategy that not only leverages these features but also incorporates a pre-trained Valence-Arousal-Dominance (VAD) model. This integration is designed to refine emotion intensity prediction by concurrently processing multiple emotional dimensions thereby embedding a richer contextual understanding into our framework. For the temporal analysis of audio data our feature fusion process utilises a Long Short-Term Memory (LSTM) network. This approach which relies solely on the provided audio data shows marked advancements over the existing baseline offering a more comprehensive understanding of emotional mimicry in naturalistic settings achieving the second place in the EMI challenge.

Related Material


[pdf]
[bibtex]
@InProceedings{Hallmen_2024_CVPR, author = {Hallmen, Tobias and Deuser, Fabian and Oswald, Norbert and Andr\'e, Elisabeth}, title = {Unimodal Multi-Task Fusion for Emotional Mimicry Intensity Prediction}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4657-4665} }