A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals

Jiangnan Tang, Jingya Wang, Kaiyang Ji, Lan Xu, Jingyi Yu, Ye Shi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 21251-21262

Abstract


Estimating full-body human motion via sparse tracking signals from head-mounted displays and hand controllers in 3D scenes is crucial to applications in AR/VR. One of the biggest challenges to this task is the one-to-many mapping from sparse observations to dense full-body motions which endowed inherent ambiguities. To help resolve this ambiguous problem we introduce a new framework to combine rich contextual information provided by scenes to benefit full-body motion tracking from sparse observations. To estimate plausible human motions given sparse tracking signals and 3D scenes we develop \text S ^2Fusion a unified framework fusing \underline S cene and sparse \underline S ignals with a conditional dif\underline Fusion model. \text S ^2Fusion first extracts the spatial-temporal relations residing in the sparse signals via a periodic autoencoder and then produces time-alignment feature embedding as additional inputs. Subsequently by drawing initial noisy motion from a pre-trained prior \text S ^2Fusion utilizes conditional diffusion to fuse scene geometry and sparse tracking signals to generate full-body scene-aware motions. The sampling procedure of \text S ^2Fusion is further guided by a specially designed scene-penetration loss and phase-matching loss which effectively regularizes the motion of the lower body even in the absence of any tracking signals making the generated motion much more plausible and coherent. Extensive experimental results have demonstrated that our \text S ^2Fusion outperforms the state-of-the-art in terms of estimation quality and smoothness.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Tang_2024_CVPR, author = {Tang, Jiangnan and Wang, Jingya and Ji, Kaiyang and Xu, Lan and Yu, Jingyi and Shi, Ye}, title = {A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {21251-21262} }