Less Is More: Generating Grounded Navigation Instructions From Landmarks

Wang, Su; Montgomery, Ceslee; Orbay, Jordi; Birodkar, Vighnesh; Faust, Aleksandra; Gur, Izzeddin; Jaques, Natasha; Waters, Austin; Baldridge, Jason; Anderson, Peter

Less Is More: Generating Grounded Navigation Instructions From Landmarks

Su Wang, Ceslee Montgomery, Jordi Orbay, Vighnesh Birodkar, Aleksandra Faust, Izzeddin Gur, Natasha Jaques, Austin Waters, Jason Baldridge, Peter Anderson; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15428-15438

Abstract

We study the automatic generation of navigation instructions from 360-degree images captured on indoor routes. Existing generators suffer from poor visual grounding, causing them to rely on language priors and hallucinate objects. Our MARKY-MT5 system addresses this by focusing on visual landmarks; it comprises a first stage landmark detector and a second stage generator--a multimodal, multilingual, multitask encoder-decoder. To train it, we bootstrap grounded landmark annotations on top of the Room-across-Room (RxR) dataset. Using text parsers, weak supervision from RxR's pose traces, and a multilingual image-text encoder trained on 1.8b images, we identify 1.1m English, Hindi and Telugu landmark descriptions and ground them to specific regions in panoramas. On Room-to-Room, human wayfinders obtain success rates (SR) of 73% following MARKY-MT5's instructions, just shy of their 76% SR following human instructions---and well above SRs with other generators. Evaluations on RxR's longer, diverse paths obtain 62-64% SRs on three languages. Generating such high-quality navigation instructions in novel environments is a step towards conversational navigation tools and could facilitate larger-scale training of instruction-following agents.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wang_2022_CVPR, author = {Wang, Su and Montgomery, Ceslee and Orbay, Jordi and Birodkar, Vighnesh and Faust, Aleksandra and Gur, Izzeddin and Jaques, Natasha and Waters, Austin and Baldridge, Jason and Anderson, Peter}, title = {Less Is More: Generating Grounded Navigation Instructions From Landmarks}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {15428-15438} }