ATM: Enhanced Alignment for Text-to-Motion Generation

Ke Han, Yueming Lyu, Weichen Yu, Nicu Sebe; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026, pp. 6862-6872

Abstract


Existing text-to-motion (T2M) generation methods primarily rely on regression-based objectives, such as minimizing positional errors. However, they lack effective semantic supervision and correction mechanisms, often leading to substantial misalignment between text and motion. To address this, we propose Aligned Text-to-Motion (ATM), a semantics-aware generation framework that automatically identifies and corrects text-motion misalignment. ATM incorporates two key components: (1) Inter-motion alignment, which detects semantic contradictions across motions and applies adaptive corrections based on the degree of semantic discrepancy, flexibly handing diverse misalignments and ensuring global text-motion consistency; (2) Intra-motion alignment, which refines locally missing or inaccurate motion semantics in an unsupervised manner by inferring semantic proxies, effectively addressing the absence of localized textual annotations. ATM is model-agnostic and can be seamlessly integrated into various T2M methods as a plug-and-play module. Extensive experiments on HumanML3D and KIT demonstrate that ATM consistently improves both generation quality and text-motion alignment. Code is available at https://github.com/ke-han-aca/ATM.git.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Han_2026_WACV, author = {Han, Ke and Lyu, Yueming and Yu, Weichen and Sebe, Nicu}, title = {ATM: Enhanced Alignment for Text-to-Motion Generation}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {6862-6872} }