ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models

Yin, Heng; Ren, Yuqiang; Yan, Ke; Ding, Shouhong; Hao, Yongtao

Heng Yin, Yuqiang Ren, Ke Yan, Shouhong Ding, Yongtao Hao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 14358-14368

Abstract

Multimodal large language models (MLLMs) have demonstrated strong language understanding and generation capabilities, excelling in visual tasks like referring and grounding. However, due to task type limitations and dataset scarcity, existing MLLMs only ground objects present in images and cannot reject non-existent objects effectively, resulting in unreliable predictions. In this paper, we introduce ROD-MLLM, a novel MLLM for Reliable Object Detection using free-form language. We propose a query-based localization mechanism to extract low-level object features. By aligning global and object-level visual information with text space, we leverage the large language model (LLM) for high-level comprehension and final localization decisions, overcoming the language understanding limitations of normal detectors. To enhance language-based object detection, we design an automated data annotation pipeline and construct the dataset ROD. This pipeline uses the referring capabilities of existing MLLMs and chain-of-thought techniques to generate diverse expressions corresponding to zero or multiple objects, addressing the shortage of training data. Experiments across various tasks, including referring, grounding, and language-based object detection, show that ROD-MLLM achieves state-of the-art performance among MLLMs. Notably, in language-based object detection, our model achieves +13.7 AP improvement on D3 benchmark over existing MLLMs and surpasses most specialized detection models, especially in scenarios requiring complex language understanding.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Yin_2025_CVPR, author = {Yin, Heng and Ren, Yuqiang and Yan, Ke and Ding, Shouhong and Hao, Yongtao}, title = {ROD-MLLM: Towards More Reliable Object Detection in Multimodal Large Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {14358-14368} }