HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding

Jiahe Zhao, Ruibing Hou, Zejie Tian, Hong Chang, Shiguang Shan; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 4317-4327

Abstract


We propose a new task to benchmark human-in-scene understanding for embodied agents: Human-In-Scene Question Answering (HIS-QA). Given a human motion within a 3D scene, HIS-QA requires the agent to comprehend human states and behaviors, reason about its surrounding environment, and answer human-related questions within the scene. To support this new task, we present HIS-Bench, a multimodal benchmark that systematically evaluates HIS understanding across a broad spectrum, from basic perception to commonsense reasoning and planning. Our evaluation of various vision-language models on HIS-Bench reveals significant limitations in their ability to handle HIS-QA tasks. To this end, we propose HIS-GPT, the first foundation model for HIS understanding. HIS-GPT integrates 3D scene context and human motion dynamics into large language models while incorporating specialized mechanisms to capture human-scene interactions. Extensive experiments demonstrate that HIS-GPT sets a new state-of-the-art on HIS-QA tasks. We hope this work inspires future research of human behavior analysis in 3D scenes, advancing embodied AI and world models.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhao_2025_ICCV, author = {Zhao, Jiahe and Hou, Ruibing and Tian, Zejie and Chang, Hong and Shan, Shiguang}, title = {HIS-GPT: Towards 3D Human-In-Scene Multimodal Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {4317-4327} }