We introduce Neural Riemannian Motion Fields (NRMF), a novel 3D generative human motion prior that enables robust, temporally consistent, and physically plausible 3D motion recovery. Unlike existing VAE or diffusion-based methods, our higher-order motion prior explicitly models the human motion in the zero level set of a collection of neural distance fields (NDFs) corresponding to pose, transition (velocity), and acceleration dynamics. Our framework is rigorous in the sense that our NDFs are constructed on the product space of joint rotations, their angular velocities, and angular accelerations, respecting the geometry of the underlying articulations. We further introduce: (i) a novel adaptive-step hybrid algorithm for projecting onto the set of plausible motions, and (ii) a novel geometric integrator to "roll out" realistic motion trajectories during test-time-optimization and generation. Our experiments show significant and consistent gains: trained on the AMASS dataset, NRMF remarkably generalizes across multiple input modalities and to diverse tasks ranging from denoising to motion in-betweening and fitting to partial 2D / 3D observations.
NRMF is a general-purpose, expressive and robust unconditional motion prior. It models the space of plausible poses (\(\theta\)), transitions (\(\dot{\theta}\)), and accelerations (\(\ddot{\theta}\)) on the zero-level set of a geometric neural distance field. This implicitly captures the data distribution. Poses are depicted alongside their transitions and accelerations, which are visualized as blue dots onto the per-joint distributions of learned transitions and as blue rings around the magnitude distribution of all accelerations.
We develop projection (\(\Pi\)) and integration algorithms to deploy NRMF into several applications as shown: (i) motion denoising from noisy observations, (ii) motion estimation on in-the-wild videos, (iii) motion in-betweening, and (iv) motion generation.
NRMF learns to represent the space of realistic human motion by modeling the zero-level sets of three distinct yet related neural distance fields over {\(\theta\), \(\dot{\theta}\), \(\ddot{\theta}\)}. Each component is trained to predict the distance to the manifold of plausible motion states using motion capture data. The pose field learns which joint configurations are human-like, the transition field captures temporal consistency across frames, and the acceleration field enforces second-order realism by modeling smooth and plausible dynamics. These fields enable projection-based inference and allow NRMF to robustly reconstruct temporally consistent and physically plausible motion.
Our method can recover the clean and plausible motion from noisy 3D observations as input, as well as infilling the missing motion of body parts and in-betweening the motion. Gaussian noise is added to the 3D observations to simulate the noisy observations.
Our method can recover clean and plausible motion with in-the-wild RGB-D observations as input.
Results on PROX, EgoBody, 3DPW and in-the-wild videos.
Our method can in-between plausible motion from only partially given keyframes as input, as well as generate the natural motion from initial poses, while keeping temporally consistent and physically plausible.
In-betweening on partial keyframes.
Generation from (common standing pose)