Interactive Multimodal Framework with Temporal Modeling for Emotion Recognition

Jun Yu, Yongqi Wang, Lei Wang, Yang Zheng, Shengfan Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 5708-5715

Abstract


This paper presents our championship-winning method for valence-arousal (VA) estimation in the 8th Affective Behavior Analysis in-the-Wild (ABAW) competition. Our approach skillfully integrates visual and audio information through a sophisticated multimodal framework. The visual branch leverages a pre-trained ResNet model to extract spatial features from facial images. The audio branches utilize pre-trained VGG models to extract both VGGish and LogMel features from speech signals. These extracted features undergo advanced temporal modeling using specifically designed Temporal Convolutional Networks (TCNs). Following this, we employ sophisticated cross-modal attention mechanisms, enabling visual features to interact with audio features through meticulously structured query-key-value attention processes. Finally, the processed features are concatenated and passed through a meticulously designed regression layer to accurately predict valence and arousal. Our method demonstrates outstanding performance on the Aff-Wild2 dataset, significantly advancing the field of effective multimodal fusion for VA estimation in unconstrained environments. By achieving first place in the ABAW VA track, our approach establishes a new benchmark in this challenging domain, demonstrating superior accuracy and robustness in real-world emotion recognition scenarios.

Related Material


[pdf]
[bibtex]
@InProceedings{Yu_2025_CVPR, author = {Yu, Jun and Wang, Yongqi and Wang, Lei and Zheng, Yang and Xu, Shengfan}, title = {Interactive Multimodal Framework with Temporal Modeling for Emotion Recognition}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {5708-5715} }