DiaLoc: An Iterative Approach to Embodied Dialog Localization

Zhang, Chao; Li, Mohan; Budvytis, Ignas; Liwicki, Stephan

Chao Zhang, Mohan Li, Ignas Budvytis, Stephan Liwicki; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 12585-12593

Abstract

Multimodal learning has advanced the performance for many vision-language tasks. However most existing works in embodied dialog research focus on navigation and leave the localization task understudied. The few existing dialog-based localization approaches assume the availability of entire dialog prior to localizaiton which is impractical for deployed dialog-based localization. In this paper we propose DiaLoc a new dialog-based localization framework which aligns with a real human operator behavior. Specifically we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization where a fusion encoder fuses vision and dialog information iteratively. We achieve state-of-the-art results on embodied dialog-based localization task in single-shot (+7.08% in Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc narrows the gap between simulation and real-world applications opening doors for future research on collaborative localization and navigation.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zhang_2024_CVPR, author = {Zhang, Chao and Li, Mohan and Budvytis, Ignas and Liwicki, Stephan}, title = {DiaLoc: An Iterative Approach to Embodied Dialog Localization}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {12585-12593} }