Lifting Motion to the 3D World via 2D Diffusion

Li, Jiaman; Liu, C. Karen; Wu, Jiajun

Jiaman Li, C. Karen Liu, Jiajun Wu; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 17518-17528

Abstract

Estimating 3D motion from 2D observations is a long-standing research challenge. Prior work typically requires training on datasets containing ground truth 3D motions, limiting their applicability to activities well-represented in existing motion capture data. This dependency particularly hinders generalization to out-of-distribution scenarios or subjects where collecting 3D ground truth is challenging, such as complex athletic movements or animal motion. We introduce MVLift, a novel approach to predict global 3D motion---including both joint rotations and root trajectories in the world coordinate system---using only 2D pose sequences for training. Our multi-stage framework leverages 2D motion diffusion models to progressively generate consistent 2D pose sequences across multiple views, a key step in recovering accurate global 3D motion. MVLift generalizes across various domains, including human poses, human-object interactions, and animal poses. Despite not requiring 3D supervision, it outperforms prior work on five datasets, including those methods that require 3D supervision.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Li_2025_CVPR, author = {Li, Jiaman and Liu, C. Karen and Wu, Jiajun}, title = {Lifting Motion to the 3D World via 2D Diffusion}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {17518-17528} }