HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction

Wang, Yuan; Li, Yali; Li, Xiang; Wang, Shengjin

Yuan Wang, Yali Li, Xiang Li, Shengjin Wang; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 7147-7157

Abstract

While flourishing developments have been witnessed in text-to-motion generation, synthesizing physically realistic, controllable, language-conditioned Human Scene Interactions (HSI) remains a relatively underexplored landscape. Current HSI methods naively rely on conditional Variational AutoEncoder (cVAE) and diffusion models. They are typically associated with limited modalities of control signals and task-specific frameworks design, leading to inflexible adaptation across various interaction scenarios and descriptive-unfaithful motions in diverse 3D physical environments. In this paper, we propose HSI-GPT, a General-Purpose Large Scene-Motion-Language Model that applies "next-token prediction" paradigm of Large Language Models to the HSI domain. HSI-GPT not only exhibits remarkable flexibility to accommodate diverse control signals (3D scenes, textual commands, key-frame poses, as well as scene affordances), but it seamlessly supports various HSI-related tasks (e.g., multi-modal controlled HSI generation, HSI understanding, and general motion completion in 3D scenes). First, HSI-GPT quantizes textual descriptions and human motions into discrete, LLM-interpretable tokens with multi-modal tokenizers. Inspired by multi-modal learning, we develop a recipe for aligning mixed-modality tokens into the shared embedding space of LLMs. These interaction tokens are then organized into unified instruction following prompts, allowing HSI-GPT to fine-tune on question-and-answer tasks. Extensive experiments and visualizations validate that our general-purpose HSI-GPT model delivers exceptional performance across multiple HSI-related tasks.

Related Material

[pdf]

[bibtex]

@InProceedings{Wang_2025_CVPR, author = {Wang, Yuan and Li, Yali and Li, Xiang and Wang, Shengjin}, title = {HSI-GPT: A General-Purpose Large Scene-Motion-Language Model for Human Scene Interaction}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {7147-7157} }