GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts

Milacski, Zoltán Á.; Niinuma, Koichiro; Kawamura, Ryosuke; de la Torre, Fernando; Jeni, László A.

Zoltán Á. Milacski, Koichiro Niinuma, Ryosuke Kawamura, Fernando de la Torre, László A. Jeni; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 4108-4118

Abstract

The connection between our 3D surroundings and the descriptive language that characterizes them would be well-suited for localizing and generating human motion in context but for one problem. The complexity introduced by multiple modalities makes capturing this connection challenging with a fixed set of descriptors. Specifically closed vocabulary scene encoders which require learning text-scene associations from scratch have been favored in the literature often resulting in inaccurate motion grounding. In this paper we propose a method that integrates an open vocabulary scene encoder into the architecture establishing a robust connection between text and scene. Our two-step approach starts with pretraining the scene encoder through knowledge distillation from an existing open vocabulary semantic image segmentation model ensuring a shared text-scene feature space. Subsequently the scene encoder is fine-tuned for conditional motion generation incorporating two novel regularization losses that regress the category and size of the goal object. Our methodology achieves up to a 30% reduction in the goal object distance metric compared to the prior state-of-the-art baseline model on the HUMANISE dataset. This improvement is demonstrated through evaluations conducted using three implementations of our framework a perceptual study and an open vocabulary experiment. Additionally our method is designed to accommodate future 2D open vocabulary segmentation methods for distillation in a plug-and-play manner.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Milacski_2025_WACV, author = {Milacski, Zolt\'an \'A. and Niinuma, Koichiro and Kawamura, Ryosuke and de la Torre, Fernando and Jeni, L\'aszl\'o A.}, title = {GHOST: Grounded Human Motion Generation with Open Vocabulary Scene-and-Text Contexts}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {4108-4118} }