Uncertain Multimodal Intention and Emotion Understanding in the Wild

Yang, Qu; Shi, Qinghongya; Wang, Tongxin; Ye, Mang

Qu Yang, Qinghongya Shi, Tongxin Wang, Mang Ye; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 24700-24709

Abstract

Understanding intention and emotion from social media poses unique challenges due to the inherent uncertainty in multimodal data, where posts often contain incomplete or missing modalities. While this uncertainty reflects real-world scenarios, it remains underexplored within the computer vision community, particularly in conjunction with the intrinsic relationship between emotion and intention. To address these challenges, we introduce the Multimodal IntentioN and Emotion Understanding in the Wild (MINE) dataset, comprising over 20,000 topic-specific social media posts with natural modality variations across text, image, video, and audio. MINE is distinctively constructed to capture both the uncertain nature of multimodal data and the implicit correlations between intentions and emotions, providing extensive annotations for both aspects. To tackle these scenarios, we propose the Bridging Emotion-Intention via Implicit Label Reasoning (BEAR) framework. BEAR consists of two key components: a BEIFormer that leverages emotion-intention correlations, and a Modality Asynchronous Prompt that handles modality uncertainty. Experiments show that BEAR outperforms existing methods in processing uncertain multimodal data while effectively mining emotion-intention relationships for social media content understanding.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Yang_2025_CVPR, author = {Yang, Qu and Shi, Qinghongya and Wang, Tongxin and Ye, Mang}, title = {Uncertain Multimodal Intention and Emotion Understanding in the Wild}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {24700-24709} }