Chain-of-Thought Guided Multi-Modal Object Re-Identification

Gao, Ya; Li, Shihao; Liu, Zhaojun; Zheng, Aihua; Li, Chenglong; Tang, Jin

Ya Gao, Shihao Li, Zhaojun Liu, Aihua Zheng, Chenglong Li, Jin Tang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 37705-37714

Abstract

With the rise of visual-language models, multi-modal ReID retrieves specific targets by integrating different spectra and textual descriptions. Existing methods merely adopt descriptive representation learning for image-text, ignoring the relationships among the intrinsic logical hierarchies of semantic features. Since Chain-of-Thought (CoT) can provide textual logical context and enhance semantic perception in large-model reasoning, we propose CoT-ReID, a CoT-guided framework that injects the Multi-modal Large Language Models (MLLMs) reasoning into multi-modal ReID. Specifically, we simulate the joint visual-textual logical decision-making of human reasoning, leveraging CoT textual logical reasoning to guide visual feature learning at the early, late, and decision-making level: At the early level, we embed the semantic reversion of CoT hierarchical reasoning into visual features to calibrate bottom-level features and emphasize visual hierarchical reasoning. Next, we take CoT hierarchical reasoning text as an anchor condition to constrain the consistency of visual cross-modal semantics. Finally, through the hierarchical reasoning process of CoT, we embed logically reasoned text attribute features into multi-modal decision-making, providing logical support for selecting discriminative identity features. By constructing CoT textual benchmarks and our proposed modules, our framework generates more robust multi-modal features in complex scenarios. Comprehensive experiments on four datasets (RGBNT100, MSVR310, WMVeID863, RGBNT201) demonstrate that our method outperforms existing approaches. Code will be released upon acceptance.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Gao_2026_CVPR, author = {Gao, Ya and Li, Shihao and Liu, Zhaojun and Zheng, Aihua and Li, Chenglong and Tang, Jin}, title = {Chain-of-Thought Guided Multi-Modal Object Re-Identification}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {37705-37714} }