ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes

Gong, Ran; Huang, Jiangyong; Zhao, Yizhou; Geng, Haoran; Gao, Xiaofeng; Wu, Qingyang; Ai, Wensi; Zhou, Ziheng; Terzopoulos, Demetri; Zhu, Song-Chun; Jia, Baoxiong; Huang, Siyuan

Ran Gong, Jiangyong Huang, Yizhou Zhao, Haoran Geng, Xiaofeng Gao, Qingyang Wu, Wensi Ai, Ziheng Zhou, Demetri Terzopoulos, Song-Chun Zhu, Baoxiong Jia, Siyuan Huang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 20483-20495

Abstract

Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete (e.g., binary) object states, which poses challenges for learning complex tasks and transferring learned policy from the simulated environment to the real world. Furthermore, the robot's ability to follow human instructions based on grounding the actions and states is limited. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD consists of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges when it comes to novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms to address this gap and underscore the potential for further research in this area.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Gong_2023_ICCV, author = {Gong, Ran and Huang, Jiangyong and Zhao, Yizhou and Geng, Haoran and Gao, Xiaofeng and Wu, Qingyang and Ai, Wensi and Zhou, Ziheng and Terzopoulos, Demetri and Zhu, Song-Chun and Jia, Baoxiong and Huang, Siyuan}, title = {ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {20483-20495} }