We show additional result comparisons between our method and three baseline methods: T2M-GPT(VQ-based), RemoDiffuse(diffusion-based), and MoMask(VQ-based).
Only ReMoDiffuse incorporates an additional temporal filtering postprocess technique for smoother animation, we report both ReMoDiffuse animation from raw genertaion and after postprocess for fair comparison.
Our method generates motion that is more realistic and more accurately follows the fine details of the textual condition.
We investigate the impact of two major components of our method: motion representation reformation and autoregressive modeling.
The result demonstrates that each component contributes significantly to the generation process, leading to high-quality motion generations.
Our method can be applied beyond the scope of standard text-to-motion generation to temporal editing. Here we present the temporal editing results (prefix, in-between, suffix) using our method. The input motion clips are presented without coloring (grey scale) and the edited contents are in full coloring.
Our method can generate diverse motions while maintaining high quality.
Our method is capable of generating high-quality, textual instruction-following 3D human motions.
We include additional 15 distinct motion examples generated by our method.
Our method demonstrates strong text-to-motion generation capability. However, as shown in the failure cases below, following lengthy textual instructions and diffusion's improvisation nature still rise challenges.
Lengthy textual instructions problems might be eased by employing a more advanced text encoder (for example, large language model) to provide more detailed condition vectors.
The Diffusion's improvisation problem can be dealt with by carefully choosing classifier-free guidance level in different scenarios.