MoMask: Generative Masked Modeling of 3D Human Motions

Submission ID: 8655

1. Motion Generation

1.1 Gallery

MoMask can generate high-quality 3D human motions across diverse text inputs. Here, we show 15 distinct examples of generated motions, animated with various characters.

A character is running on a treadmill.

The person holds its left foot with its left hand, puts its right foot up and left hand up too.
A person stands for a few seconds and picks up its arms and shakes them.

This person kicks with their right leg then jabs several times.

A person walks with a limp, their left leg gets injured.

A person walks in a clockwise circle and stops where he began.



A man bends down and picks something up with his right hand.


The man walked forward, spun right on one foot and walked back to his original position.

A person stands, crosses left leg in front of the right, lowering themselves until they are sitting, both hands on the floor before standing and uncrossing legs.

A man is walking forward then steps over an object, then continues walking forward.

A person repeatedly blocks their face with their right arm.

This person takes 4 steps forward starting with their right foot.


The person takes 4 steps backwards.

The person did a kick spin to the left.

A figure stretches its hands and arms above its head.

1.2 Diverse Generation

Our MoMask also maintains a certain level of diversity during generation.

A person jumps up and then lands.


The person was pushed but did not fall.


The person does a salsa dance.

2. Comparison

2.1 Comparison with Other Methods

We compare MoMask against three strong baseline approaches, spanning diffusion models (e.g., MDM, MLD), and autoregressive models (e.g., T2M-GPT). In contrast to these existing works, MoMask excels in capturing nuanced language concepts, resulting in the generation of more realistic motions.











This person stumbles left and right while moving forward.
Ours (MoMask)
MDM
T2M-GPT
MLD











A person has their forearms raised in front of them, then lowers them.
Ours (MoMask)
MDM
T2M-GPT
MLD











A person grabbed the leg and did something.
Ours (MoMask)
MDM
T2M-GPT
MLD

2.2 Impact of Residual Layers on Reconstruction

We investigate the impact of varying the number of residual quantization layers on reconstruction results. In the visual comparison, we present the ground truth motion alongside motions recovered from different RVQ-VAEs with 5 residual layers, 3 residual layers, and no residual layers (traditional VQ-VAE), respectively. The result demonstrates that RVQ significantly reduces reconstruction errors, leading to high-fidelity motion tokenization.

2.3 Impact of Residual Tokens on Generation

Utilizing the pre-trained RVQ model, we conduct a visual comparison of generated motions by considering different combinations of tokens, specifically focusing on the base-layer tokens alone, base-layer tokens combined with the first 3 residual-layer tokens, and base-layer tokens combined with the first 5 residual-layer tokens. The observation indicates that the absence of residual tokens may result in the failure to accurately perform subtle actions, as illustrated by the case of stumbling in this example.

A man walks forward, stumbles to the right, and then regains his balance and keeps walking forward.

3. Application: Temporal Inpainting

We showcase MoMask's capability to inpaint specific regions within existing motion clips, conditioned on a textual description. Subsequently, we present the inpainting results for the middle, suffix, and prefix regions of motion clips. The input motion clips are highlighted in purple, and the synthesized content is represented in cyan.










Inbetween

(Purple=Input, Cyan=Synthesis)
+ "A person falls down and gets back up quickly."

+ "A person is pushed."










Prefix

(Purple=Input, Cyan=Synthesis)
+ "A person gets up from the ground."

+ "A person is doing warm up."










Suffix

(Purple=Input, Cyan=Synthesis)
+ "A person bows."

+ "A person squats."

5. Failure Cases

While MoMask demonstrates strong capabilities in generating 3D human motions from textual descriptions, challenges arise with rare textual prompts and actions involving rapid root motion, such as spinning. The former challenge might be mitigated by employing a larger language model that can simplify complex descriptions, while the latter is attributed to potential limitations in the pose representation and vector quantization errors.

A person is crouched and stands then resumes the crouched position.
A person does a spin dance.