Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment

Chen, Xiaodong; Bao, Qian; Liu, Xudong; Fang, Jianping; Fang, Jintao; Zhang, Yongdong; Mei, Tao; Liu, Wu

Xiaodong Chen, Qian Bao, Xudong Liu, Jianping Fang, Jintao Fang, Yongdong Zhang, Tao Mei, Wu Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 9342-9351

Abstract

Although progress has been made in LLM-based text-driven motion generation, it still has the limitations of generating fine-grained and semantically consistent motions. These limitations stem from: 1) fine-grained motion quantization errors; 2) mismatches between causal reasoning language and non-causal motion representation; and 3) lack of human preference alignment. To solve them, this paper proposes MoTiGA, a multi-level causal LLM-based text-to-motion generation framework with human alignment. Firstly, MoTiGA employs Causal RVQ-VAE for multi-level causal fine-grained motion representation, then explores iterative residual quantization and causal convolutions to reduce fine-grained motion quantization errors, while preserving the causality as language presentation. Furthermore, the framework incorporates a time-lagged causal prediction strategy, enabling parallel prediction across motion token levels while maintaining temporal dependencies. Finally, to enhance human alignment, we propose Multi-level Hybrid-weighted Preference Optimization (MHPO), which dynamically adjusts semantic similarity weighting and continuous similarity scores. For MHPO, we also release the HumanML3D-R dataset, the first large-scale preference dataset for motion generation, with 101,490 human preference pairs. Evaluations show MoTiGA's superior performance, with an 82.3% FID improvement on HumanML3D and a 64.7% improvement on KIT-ML over other LLM-based methods.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Chen_2026_CVPR, author = {Chen, Xiaodong and Bao, Qian and Liu, Xudong and Fang, Jianping and Fang, Jintao and Zhang, Yongdong and Mei, Tao and Liu, Wu}, title = {Multi-level Causal LLM-based Text-to-Motion Generation with Human Alignment}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {9342-9351} }