MoMask can generate high-quality 3D human motions across diverse text inputs. Here, we show 15 distinct examples of generated motions, animated with various characters.
Our MoMask also maintains a certain level of diversity during generation.
We compare MoMask against three strong baseline approaches, spanning diffusion models (e.g., MDM, MLD), and autoregressive models (e.g., T2M-GPT). In contrast to these existing works, MoMask excels in capturing nuanced language concepts, resulting in the generation of more realistic motions.
We investigate the impact of varying the number of residual quantization layers on reconstruction results. In the visual comparison, we present the ground truth motion alongside motions recovered from different RVQ-VAEs with 5 residual layers, 3 residual layers, and no residual layers (traditional VQ-VAE), respectively. The result demonstrates that RVQ significantly reduces reconstruction errors, leading to high-fidelity motion tokenization.
Utilizing the pre-trained RVQ model, we conduct a visual comparison of generated motions by considering different combinations of tokens, specifically focusing on the base-layer tokens alone, base-layer tokens combined with the first 3 residual-layer tokens, and base-layer tokens combined with the first 5 residual-layer tokens. The observation indicates that the absence of residual tokens may result in the failure to accurately perform subtle actions, as illustrated by the case of stumbling in this example.
We showcase MoMask's capability to inpaint specific regions within existing motion clips, conditioned on a textual description. Subsequently, we present the inpainting results for the middle, suffix, and prefix regions of motion clips. The input motion clips are highlighted in purple, and the synthesized content is represented in cyan.
While MoMask demonstrates strong capabilities in generating 3D human motions from textual descriptions, challenges arise with rare textual prompts and actions involving rapid root motion, such as spinning. The former challenge might be mitigated by employing a larger language model that can simplify complex descriptions, while the latter is attributed to potential limitations in the pose representation and vector quantization errors.