-
[pdf]
[supp]
[bibtex]@InProceedings{Jiang_2025_ICCV, author = {Jiang, Biao and Chen, Xin and Zeng, Ailing and Sun, Xinru and Yin, Fukun and Zeng, Xianfang and Zhang, Xuanyang and Yu, Gang and Chen, Tao}, title = {Causal Motion Tokenizer for Streaming Motion Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {2024-2034} }
Causal Motion Tokenizer for Streaming Motion Generation
Abstract
Recent advancements in human motion generation have leveraged various multimodal inputs, including text, music, and audio. Despite significant progress, the challenge of generating human motion in a streaming context--particularly from text--remains underexplored. Traditional methods often rely on temporal modalities, leaving text-based motion generation with limited capabilities, especially regarding seamless transitions and low latency. In this work, we introduce MotionStream, a pioneering motion-streaming pipeline designed to continuously generate human motion sequences that adhere to the semantic constraints of input text. Our approach utilizes a Causal Motion Tokenizer, built on residual vector quantized variational autoencoder (RVQ-VAE) with causal convolution, to enhance long sequence handling and ensure smooth transitions between motion segments. Furthermore, we employ a Masked Transformer and Residual Transformer to generate motion tokens efficiently. Extensive experiments validate that MotionStream not only achieves state-of-the-art performance in motion composition but also maintains real-time generation capabilities with significantly reduced latency. We highlight the versatility of MotionStream through a story-to-motion application, demonstrating its potential for robotic control, animation, and gaming.
Related Material
