-
[pdf]
[supp]
[bibtex]@InProceedings{Liu_2025_WACV, author = {Liu, Qianyi and Zhang, Siqi and Qiao, Yanyuan and Zhu, Junyou and Li, Xiang and Guo, Longteng and Wang, Qunbo and He, Xingjian and Wu, Qi and Liu, Jing}, title = {GroundingMate: Aiding Object Grounding for Goal-Oriented Vision-and-Language Navigation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1775-1784} }
GroundingMate: Aiding Object Grounding for Goal-Oriented Vision-and-Language Navigation
Abstract
Goal-Oriented Vision-and-Language Navigation (VLN) aims to enable agents to navigate to specified locations and identify designated target objects following natural language instruction. This approach has gained popularity due to its close alignment with real-world scenarios. However existing studies have predominantly focused on enhancing navigation performance neglecting the ability to locate objects at the navigation endpoint. This oversight has resulted in a significant discrepancy between the success rates of navigation and object grounding. The challenge is compounded by the complex reasoning required by the instructions and the necessity to synthesize multi-perspective images of objects which overwhelms traditional object grounding methods. We leverage the Multi-Modal Large Language Model (MLLM) to bridge this gap allowing agents to seek assistance from these models when struggling to locate the target object. The agent conducts a multi-stage evaluation to discern the cause of its confusion and promptly extracts and updates the most relevant information for MLLM to assess. Our method is plug-and-play and model-agnostic facilitating integration with numerous existing VLN strategies without the need for retraining. Implementing our approach across four distinct methods has improved performance on the REVERIE and SOON datasets demonstrating the effectiveness and generalizability of our technique.
Related Material