-
[pdf]
[supp]
[bibtex]@InProceedings{Yang_2026_CVPR, author = {Yang, Jing and Yang, Sen and Duan, Boqiang and Dai, Ming and Zhang, Wei and Tan, Xiao and Chen, Kunbin and He, Wei and Wang, Jingdong and Wang, Hanli}, title = {Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {5175-5186} }
Hugging Visual Prompt and Segmentation Tokens: Consistency Learning for Fine-Grained Visual Understanding in MLLMs
Abstract
Recently, multimodal large language models (MLLMs) have achieved remarkable success in general multimodal tasks. Increasing attention has been given to leveraging MLLMs for fine-grained visual understanding, such as region-level captioning and pixel-level grounding. However, most existing approaches are task-specific. While some recent unified approaches attempt to handle both types simultaneously, they still fall short of deeply exploring the underlying associations across tasks. To bridge this gap, we propose FCLM, a large multimodal model designed to jointly support fine-grained visual understanding through consistency learning. The central premise is that pixel-level captioning and grounding are mutually beneficial and complementary, each enhancing the other in achieving a fine-grained understanding of visual content. Specifically, FCLM analyzes the representation features - visual prompt and segmentation tokens - required for the two types of visual tasks, and achieves advanced reasoning and perception through a newly-designed consistency learning loss and a two-stage training framework. Moreover, a hybrid region extractor is designed to enhance visual prompt embeddings, yielding semantically discriminative representations for detailed captioning. Furthermore, a novel detailed localized referring expression segmentation (DL-RES) task is introduced to evaluate the model's ability to localize targets from detailed textual descriptions. Extensive experiments on seven visual understanding tasks demonstrate the superior performance and strong generalization of FCLM.
Related Material

