-
[pdf]
[supp]
[bibtex]@InProceedings{Tseng_2025_WACV, author = {Tseng, Kuan-Wei and Kawakami, Rei and Ikehata, Satoshi and Sato, Ikuro}, title = {CST: Character State Transformer for Object-Conditioned Human Motion Prediction}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {63-72} }
CST: Character State Transformer for Object-Conditioned Human Motion Prediction
Abstract
Modeling human-object interaction is a challenging task in human motion prediction due to its limited data availability. Although pre-training on other large-scale datasets is an effective solution it is not applicable to previous auto-regressive motion predictors due to their specialized setting. We propose a partially pre-trainable Character State Transformer (CST) for object-conditioned human motion prediction. As a crucial enhancement to the auto-regressive model the CST consists of two main components: a task-agnostic Human Joint Transformer (HJT) and a task-dependent Spatiotemporal State Transformer (SST). The HJT is designed to extract spatial features from the skeleton by attention mechanism. The SST conversely manages multi-modal trajectory and goal-related data by combining self- and cross- attention mechanisms. Differing from previous methods the HJT can be pre-trained on extensive motion datasets and subsequently fine-tuned for specific human-object interaction tasks. To this end we propose a self-supervised masked reconstruction task for the pre-training HJT. Experimental results on the SAMP dataset reveal that our CST model excels beyond existing methods in motion prediction error achieving this with a reduced number of parameters. We will release the code upon acceptance.
Related Material