UniTMGE: Uniform Text-Motion Generation and Editing Model via Diffusion

Wang, Ruoyu; He, Yangfan; Sun, Tengjiao; Li, Xiang; Shi, Tianyu

Ruoyu Wang, Yangfan He, Tengjiao Sun, Xiang Li, Tianyu Shi; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6104-6114

Abstract

Current methods have shown promising results in applying diffusion models to motion generation given text input. However these methods are limited to unimodal inputs and outputs restricted to motion generation alone and lacking multimodal control capabilities. To address these issues we introduce TMMGE a text-motion multimodal generation and editing framework based on diffusion. TMMGE overcomes single-modality limitations enabling exceptional performance and strong generalization across multiple tasks like text-driven motion generation motion captioning motion completion and multi-modal motion editing. TMMGE comprises three components: UTMV for mapping text and motion into a shared latent space using contrastive learning a controllable diffusion model customized for the UTMV space and MCRE for unifying multimodal conditions into CLIP representations enabling precise multimodal control and flexible motion editing through simple linear operations. We conducted both closed-world experiments and open-world experiments using the Motion-X dataset with detailed text descriptions with results demonstrating our model's effectiveness and generalizability across multiple tasks.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Wang_2025_WACV, author = {Wang, Ruoyu and He, Yangfan and Sun, Tengjiao and Li, Xiang and Shi, Tianyu}, title = {UniTMGE: Uniform Text-Motion Generation and Editing Model via Diffusion}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6104-6114} }