ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis

Yun Chang, Leonor Fermoselle, Duy Ta, Bernadette Bucher, Luca Carlone, Jiuguang Wang; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 29458-29468

Abstract


While recent work in scene reconstruction and understanding has made strides in grounding natural language to physical 3D environments, it is still challenging to ground abstract, high-level instructions to a 3D scene. High-level instructions might not explicitly invoke semantic elements in the scene, and even the process of breaking a high-level task into a set of more concrete subtasks --a process called hierarchical task analysis-- is environment-dependent. In this work, we propose ASHiTA, the first framework that generates a task hierarchy grounded to a 3D scene graph by breaking down high-level tasks into grounded subtasks. ASHiTA alternates LLM-assisted hierarchical task analysis --to generate the task breakdown-- with task-driven scene graph construction to generate a suitable representation of the environment. Our experiments show that ASHiTA performs significantly better than LLM baselines in breaking down high-level tasks into environment-dependent subtasks and is additionally able to achieve grounding performance comparable to state-of-the-art methods

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chang_2025_CVPR, author = {Chang, Yun and Fermoselle, Leonor and Ta, Duy and Bucher, Bernadette and Carlone, Luca and Wang, Jiuguang}, title = {ASHiTA: Automatic Scene-grounded HIerarchical Task Analysis}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {29458-29468} }