MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network

Vrushank Ahire, Kunal Shah, Mudasir Khan, Nikhil Pakhale, Lownish Sookha, Mudasir Ganaie, Abhinav Dhall; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 5835-5845

Abstract


Dynamic emotion recognition in the wild remains challenging due to the transient nature of emotional expressions and temporal misalignment of multi-modal cues. Traditional approaches predict valence and arousal and often overlook the inherent correlation between these two dimensions. The proposed Multi-modal Attention for Valence-Arousal Emotion Network (MAVEN) integrates visual, audio, and textual modalities through a bi-directional cross-modal attention mechanism. MAVEN uses modality-specific encoders to extract features from synchronized video frames, audio segments, and transcripts, predicting emotions in polar coordinates following Russell's circumplex model. The evaluation of the Aff-Wild2 dataset using MAVEN achieved a concordance correlation coefficient (CCC) of 0.3061, surpassing the ResNet-50 baseline model with a CCC of 0.22. The multistage architecture captures the subtle and transient nature of emotional expressions in conversational videos and improves emotion recognition in real-world situations.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Ahire_2025_CVPR, author = {Ahire, Vrushank and Shah, Kunal and Khan, Mudasir and Pakhale, Nikhil and Sookha, Lownish and Ganaie, Mudasir and Dhall, Abhinav}, title = {MAVEN: Multi-modal Attention for Valence-Arousal Emotion Network}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {5835-5845} }