OpenEQA: Embodied Question Answering in the Era of Foundation Models

Majumdar, Arjun; Ajay, Anurag; Zhang, Xiaohan; Putta, Pranav; Yenamandra, Sriram; Henaff, Mikael; Silwal, Sneha; Mcvay, Paul; Maksymets, Oleksandr; Arnaud, Sergio; Yadav, Karmesh; Li, Qiyang; Newman, Ben; Sharma, Mohit; Berges, Vincent; Zhang, Shiqi; Agrawal, Pulkit; Bisk, Yonatan; Batra, Dhruv; Kalakrishnan, Mrinal; Meier, Franziska; Paxton, Chris; Sax, Alexander; Rajeswaran, Aravind

Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, Aravind Rajeswaran; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 16488-16498

Abstract

We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory exemplified by agents on smart glasses or by actively exploring the environment as in the case of mobile robots. We accompany our formulation with OpenEQA -- the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol we evaluate several state-of-the-art foundation models including GPT-4V and find that they significantly lag behind human-level performance. Consequently OpenEQA stands out as a straightforward measurable and practically relevant benchmark that poses a considerable challenge to current generation of foundation models. We hope this inspires and stimulates future research at the intersection of Embodied AI conversational agents and world models.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Majumdar_2024_CVPR, author = {Majumdar, Arjun and Ajay, Anurag and Zhang, Xiaohan and Putta, Pranav and Yenamandra, Sriram and Henaff, Mikael and Silwal, Sneha and Mcvay, Paul and Maksymets, Oleksandr and Arnaud, Sergio and Yadav, Karmesh and Li, Qiyang and Newman, Ben and Sharma, Mohit and Berges, Vincent and Zhang, Shiqi and Agrawal, Pulkit and Bisk, Yonatan and Batra, Dhruv and Kalakrishnan, Mrinal and Meier, Franziska and Paxton, Chris and Sax, Alexander and Rajeswaran, Aravind}, title = {OpenEQA: Embodied Question Answering in the Era of Foundation Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {16488-16498} }