OctoNav: Towards Generalist Embodied Navigation

Gao, Chen; Jin, Liankai; Peng, Xingyu; Zhang, Jiazhao; Deng, Yue; Li, Annan; Wang, He; Liu, Si

Chen Gao, Liankai Jin, Xingyu Peng, Jiazhao Zhang, Yue Deng, Annan Li, He Wang, Si Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 40074-40084

Abstract

Embodied navigation stands as a foundation pillar within the pursuit of embodied intelligence. However, previous navigation research is divided into different tasks/capabilities, e.g., ObjNav, ImgNav and VLN, where they differ in task settings/objectives and modalities, making datasets and methods designed individually. In this work, we take steps toward generalist navigation, which can follow free-form instructions that include arbitrary compounds of modality and capability. To achieve this, we propose a large-scale benchmark and corresponding method, termed OctoNav-Bench and OctoNav-R1. Specifically, OctoNav-Bench is constructed via a designed automatic annotation pipeline. We thoroughly craft instruction-trajectory pairs, where instructions are diverse in free-form with arbitrary modality and capability. Also, we construct a Think-Before-Action (TBA-CoT) dataset within OctoNav-Bench to provide the thinking process behind actions. For OctoNav-R1, we build it upon MLLMs and adapt it to a VLA-type model, which can produce low-level actions solely based on 2D visual observations. Moreover, we design a Hybrid Training Paradigm (HTP) that consists of three stages, i.e., Action-/TBA-SFT, Nav-GRPO, and Online RL stages. Each stage contains designed learning policies and rewards. Specifically, inspired by the OpenAI-o1 and DeepSeek-R1, which show impressive reasoning ability via thinking-before-answer, we design TBA-SFT and Nav-GRPO to achieve thinking-before-action for embodied navigation, improving model's reasoning ability toward generalists. TBA-SFT utilizes the TBA-CoT dataset to fine-tune the model, and then we leverage Nav-GRPO to improve its thinking ability. Finally, OctoNav-R1 shows superior performance compared with the previous methods.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Gao_2026_CVPR, author = {Gao, Chen and Jin, Liankai and Peng, Xingyu and Zhang, Jiazhao and Deng, Yue and Li, Annan and Wang, He and Liu, Si}, title = {OctoNav: Towards Generalist Embodied Navigation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {40074-40084} }