Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding

Sai Wang, Yutian Lin, Yu Wu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14261-14270

Abstract


Unsupervised visual grounding methods alleviate the issue of expensive manual annotation of image-query pairs by generating pseudo-queries. However existing methods are prone to confusing the spatial relationships between objects and rely on designing complex prompt modules to generate query texts which severely impedes the ability to generate accurate and comprehensive queries due to ambiguous spatial relationships and manually-defined fixed templates. To tackle these challenges we propose a omni-directional language query generation approach for unsupervised visual grounding named Omni-Q. Specifically we develop a 3D spatial relation module to extend the 2D spatial representation to 3D thereby utilizing 3D location information to accurately determine the spatial position among objects. Besides we introduce a spatial graph module leveraging the power of graph structures to establish accurate and diverse object relationships and thus enhancing the flexibility of query generation. Extensive experiments on five public benchmark datasets demonstrate that our method significantly outperforms existing state-of-the-art unsupervised methods by up to 16.17%. In addition when applied in the supervised setting our method can freely save up to 60% human annotations without a loss of performance.

Related Material


[pdf]
[bibtex]
@InProceedings{Wang_2024_CVPR, author = {Wang, Sai and Lin, Yutian and Wu, Yu}, title = {Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {14261-14270} }