-
[pdf]
[bibtex]@InProceedings{Guo_2025_CVPR, author = {Guo, Keyu and Huang, Yongle and Sun, Shijie and Song, Xiangyu and Feng, Mingtao and Liu, Zedong and Song, Huansheng and Wang, Tiantian and Li, Jianxin and Akhtar, Naveed and Mian, Ajmal Saeed}, title = {Beyond Human Perception: Understanding Multi-Object World from Monocular View}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {3751-3760} }
Beyond Human Perception: Understanding Multi-Object World from Monocular View
Abstract
Language and binocular vision play a crucial role in human understanding of the world. Advancements in artificial intelligence have also made it possible for machines to develop 3D perception capabilities essential for high-level scene understanding. However, only monocular cameras are often available in practice due to cost and space constraints. Enabling machines to achieve accurate 3D understanding from a monocular view is practical but presents significant challenges. We introduce MonoMulti-3DVG, a novel task aimed at achieving multi-object 3D Visual Grounding (3DVG) based on monocular RGB images, allowing machines to better understand and interact with the 3D world. To this end, we construct a large-scale benchmark dataset, MonoMulti3D-ROPE, and propose a model, CyclopsNet that integrates a State-Prompt Visual Encoder (SPVE) module with a Denoising Alignment Fusion (DAF) module to achieve robust multi-modal semantic alignment and fusion. This leads to more stable and robust multi-modal joint representations for downstream tasks. Experimental results show that our method significantly outperforms existing techniques on the MonoMulti3D-ROPE dataset. Our dataset and code are available at https://github.com/JasonHuang516/MonoMulti-3DVG
Related Material