Topological Planning With Transformers for Vision-and-Language Navigation

Kevin Chen, Junshen K. Chen, Jo Chuang, Marynel Vazquez, Silvio Savarese; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 11276-11286

Abstract


Conventional approaches to vision-and-language navigation (VLN) are trained end-to-end but struggle to perform well in freely traversable environments. Inspired by the robotics community, we propose a modular approach to VLN using topological maps. Given a natural language instruction and topological map, our approach leverages attention mechanisms to predict a navigation plan in the map. The plan is then executed with low-level actions (e.g. forward, rotate) using a robust controller. Experiments show that our method outperforms previous end-to-end approaches, generates interpretable navigation plans, and exhibits intelligent behaviors such as backtracking.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chen_2021_CVPR, author = {Chen, Kevin and Chen, Junshen K. and Chuang, Jo and Vazquez, Marynel and Savarese, Silvio}, title = {Topological Planning With Transformers for Vision-and-Language Navigation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2021}, pages = {11276-11286} }