Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues

Hajer Guerdelli, Claudio Ferrari, Stefano Berretti, Alberto Del Bimbo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 5727-5736

Abstract


Emotion prediction is essential for affective computing applications, including human-computer interaction and social behavior analysis. In interpersonal settings, accurately predicting emotional states is crucial for modeling social dynamics. We propose a multimodal framework that integrates facial expressions and speech cues to enhance emotion prediction in interpersonal video interactions. Facial features are extracted via a deep attention-based network, while speech is encoded using Wav2Vec 2.0. The resulting multimodal features are modeled temporally using a Long Short-Term Memory (LSTM) network. To adapt the IMEmo dataset for multimodal learning, we introduce a novel speech-feature alignment strategy that ensures synchronization between facial and vocal expressions. Our approach investigates the impact of multimodal fusion in emotion prediction, demonstrating its effectiveness in capturing complex emotional dynamics. Experiments show that our framework improves sentiment classification accuracy by over 17% compared to facial-only baselines. While fine-grained emotion recognition remains challenging, our results highlight the enhanced robustness and generalizability of our method in real-world interpersonal scenarios.

Related Material


[pdf]
[bibtex]
@InProceedings{Guerdelli_2025_CVPR, author = {Guerdelli, Hajer and Ferrari, Claudio and Berretti, Stefano and Del Bimbo, Alberto}, title = {Multimodal Emotion Prediction in Interpersonal Videos Integrating Facial and Speech Cues}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {5727-5736} }