-
[pdf]
[supp]
[bibtex]@InProceedings{Yang_2024_CVPR, author = {Yang, Zeyuan and Liu, Jiageng and Chen, Peihao and Cherian, Anoop and Marks, Tim K. and Le Roux, Jonathan and Gan, Chuang}, title = {RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {16251-16261} }
RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation
Abstract
We leverage Large Language Models (LLM) for zeroshot Semantic Audio Visual Navigation (SAVN). Existing methods utilize extensive training demonstrations for reinforcement learning yet achieve relatively low success rates and lack generalizability. The intermittent nature of auditory signals further poses additional obstacles to inferring the goal information. To address this challenge we present the Reflective and Imaginative Language Agent (RILA). By employing multi-modal models to process sensory data we instruct an LLM-based planner to actively explore the environment. During the exploration our agent adaptively evaluates and dismisses inaccurate perceptual descriptions. Additionally we introduce an auxiliary LLMbased assistant to enhance global environmental comprehension by mapping room layouts and providing strategic insights. Through comprehensive experiments and analysis we show that our method outperforms relevant baselines without training demonstrations from the environment and complementary semantic information.
Related Material