MonoVLM: Monocular 3D Visual Grounding with Vision Language Models

Qu, Huaizhi; Mahjoub, Hossein Nourkhiz; Tadiparthi, Vaishnav; Lee, Kwonjoon; Chen, Tianlong

Huaizhi Qu, Hossein Nourkhiz Mahjoub, Vaishnav Tadiparthi, Kwonjoon Lee, Tianlong Chen; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 30986-30996

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable capabilities in instruction following and 2D visual understanding. However, state-of-the-art VLMs, including GPT-5, still struggle with 3D perception, particularly in tasks such as monocular 3D visual grounding. While specialized vision-only models excel in this domain, they often lack the rich semantic understanding inherent to VLMs. To bridge this gap, we propose \texttt MonoVLM , a novel triple-stage training framework that enables VLMs to perform accurate monocular 3D grounding. The core of our method is a progressive training process that uses Group Relative Policy Optimization (GRPO) to gradually teach the model to first localize the described object, then understand its 3D structure, and finally estimate the full 3D bounding box accurately. Comprehensive experiments show that \texttt MonoVLM models significantly outperform existing VLMs and even surpass the performance of specialized vision-only models. We validate our design via extensive comparisons and ablation studies.

Related Material

[pdf]

[bibtex]

@InProceedings{Qu_2026_CVPR, author = {Qu, Huaizhi and Mahjoub, Hossein Nourkhiz and Tadiparthi, Vaishnav and Lee, Kwonjoon and Chen, Tianlong}, title = {MonoVLM: Monocular 3D Visual Grounding with Vision Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {30986-30996} }