Rethinking Diffusion for Text-Driven Human Motion Generation: Redundant Representations, Evaluation, and Masked Autoregression

Submission ID: 7630

1. Additional Comparisons

1.1 Additional Comparison with Baseline Methods

We show additional result comparisons between our method and three baseline methods: T2M-GPT(VQ-based), RemoDiffuse(diffusion-based), and MoMask(VQ-based).
Only ReMoDiffuse incorporates an additional temporal filtering postprocess technique for smoother animation, we report both ReMoDiffuse animation from raw genertaion and after postprocess for fair comparison.
Our method generates motion that is more realistic and more accurately follows the fine details of the textual condition.







A man steps forward, swings his leg, and turns all the way around.
Ours
T2M-GPT
ReMoDiffuse with Temporal Filter Postprocess
ReMoDiffuse
MoMask







A person doing a forward kick with each leg.
Ours
T2M-GPT
ReMoDiffuse with Temporal Filter Postprocess
ReMoDiffuse
MoMask







A man walks forward and then trips towards the right.
Ours
T2M-GPT
ReMoDiffuse with Temporal Filter Postprocess
ReMoDiffuse
MoMask







A person walks forward, stepping up with their right leg and down with their left, then turns to their left and walks, then turns to their left and starts stepping up.
Ours
T2M-GPT
ReMoDiffuse with Temporal Filter Postprocess
ReMoDiffuse
MoMask







A person fastly swimming forward.
Ours
T2M-GPT
ReMoDiffuse with Temporal Filter Postprocess
ReMoDiffuse
MoMask

1.2 Ablation Study

We investigate the impact of two major components of our method: motion representation reformation and autoregressive modeling. The result demonstrates that each component contributes significantly to the generation process, leading to high-quality motion generations.

A person walks in a circular counterclockwise direction one time before returning back to his/her original position.

Full Method
W/o Autoregressive Modeling
(Does not follow textual instruction to return to the original position)
W/o Motion Representation Reformation
(Shaking and inaccurate motion)

2. Temporal Editing

Our method can be applied beyond the scope of standard text-to-motion generation to temporal editing. Here we present the temporal editing results (prefix, in-between, suffix) using our method. The input motion clips are presented without coloring (grey scale) and the edited contents are in full coloring.










Prefix

(No Color=Input,
Full Color=Edited)
Original: "A person walks in a circular counterclockwise direction one time before returning back to his/her original position."
+ Prefix: "A person dances around."

Original: "The man takes 4 steps backwards."
+ Prefix: "The man waves both hands."










In-Between

(No Color=Input,
Full Color=Edited)
Original: "The person fell down and is crawling away from someone."
+ In-Between: "The person jumps up and down."

Original: "A person walks ina curved line."
+ In-Between: "The person takes a small jumps."










Suffix

(No Color=Input,
Full Color=Edited)
Original: "A person is walking across a narrow beam."
+ Suffix: "A person raises his hands."

Original: "A man rises from the ground, walks in a circle and
- sits back down on the ground."

+ Suffix: "A man starts to run."

3. Additional Motion Generation Visualizations

3.1 Generation Diversity

Our method can generate diverse motions while maintaining high quality.

The person was pushed but did not fall.


A person walks around.


A person jumps up and then lands.

3.2 Additional Visualization Gallery

Our method is capable of generating high-quality, textual instruction-following 3D human motions. We include additional 15 distinct motion examples generated by our method.

A person waves with both arms above head.

A man slowing walking forward.
The toon is standing, swaying a bit, then raising their left wrist as to check the time on a watch.

A man walks forward before stumbling backwards and the continues walking forward.

The person fell down and is crawling away from someone.

The sim reaches to their left and right, grabbing an object and appearing to clean it.



The man takes 4 steps backwards.


She jumps up and down, kicking her heels in the air.

A person who is standing lifts his hands and claps them four times.

A person who is running, stops, bends over and looks down while taking small steps, then resumes running.

A person walks slowly forward holding handrail with left hand.

The person kick his left foot up and both hands up in counterclockwise circle and stop.


A person steps to the left sideways.

A person is walking across a narrow beam.

A person does a drumming movement with both hands.

3.3 Failure Cases

Our method demonstrates strong text-to-motion generation capability. However, as shown in the failure cases below, following lengthy textual instructions and diffusion's improvisation nature still rise challenges. Lengthy textual instructions problems might be eased by employing a more advanced text encoder (for example, large language model) to provide more detailed condition vectors. The Diffusion's improvisation problem can be dealt with by carefully choosing classifier-free guidance level in different scenarios.

(Lengthy Textual Instructions -> Loss of some small details)
A person bends at the waist and seems to pick up something with their right hand, then walks forward and places it on something in front of them, then walks back.
(Diffusion's Improvisation Nature -> After generate text-required poses, begins to improvise at the very end.)
A figure puts their hands into a praying motion then out his arma back at his side.