Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition

Zhou, Qianrui; Xu, Hua; Gu, Yunjin; Wang, Yifan; Li, Songze; Zhang, Hanlei

Qianrui Zhou, Hua Xu, Yunjin Gu, Yifan Wang, Songze Li, Hanlei Zhang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 14979-14989

Abstract

Multimodal intent recognition aims to infer human intents by jointly modeling various modalities, playing a pivotal role in real-world dialogue systems. However, current methods struggle to model hierarchical semantics underlying complex intents and lack the capacity for self-evolving reasoning over multimodal representations. To address these issues, we propose HIER, a novel method that integrates hierarchical semantic representation with evolutionary reasoning based on Multimodal Large Language Model (MLLM). Inspired by human cognition, HIER introduces a structured reasoning paradigm that organizes multimodal semantics into three progressively abstracted levels. It starts with modality-specific tokens capturing localized semantic cues, which are then clustered via a label-guided strategy to form mid-level semantic concepts. To capture higher-order structure, inter-concept relations are selected using JS divergence scores to highlight salient dependencies across concepts. These hierarchical representations are then injected into MLLM via CoT-driven prompting, enabling step-wise reasoning. Besides, HIER utilizes a self-evolution mechanism that refines semantic representations through MLLM feedback, allowing dynamic adaptation during inference. Experiments on three challenging benchmarks show that HIER consistently outperforms state-of-the-art methods and MLLMs with 1-3% gains across all metrics. Code and more results are available at https://github.com/thuiar/HIER.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zhou_2026_CVPR, author = {Zhou, Qianrui and Xu, Hua and Gu, Yunjin and Wang, Yifan and Li, Songze and Zhang, Hanlei}, title = {Evolutionary Multimodal Reasoning via Hierarchical Semantic Representation for Intent Recognition}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {14979-14989} }