-
[pdf]
[bibtex]@InProceedings{Zhu_2025_WACV, author = {Zhu, Yanjun and Bai, Chen and Lu, Cheng and Doermann, David and Lapedriza, Agata}, title = {UniMotion: Bridging 2D and 3D Representations for Human Motion Prediction}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {52-62} }
UniMotion: Bridging 2D and 3D Representations for Human Motion Prediction
Abstract
Human motion prediction (HMP) forecasts future human motion (pose sequences) based on previous motion data. While existing methods excel by learning motion dynamics from adjacent 3D skeleton poses they face a significant challenge: the reliance on large-scale 3D pose annotations which are costly to produce and often fail to capture the full diversity of actions and scenarios. This issue is particularly pronounced for underrepresented groups such as the elderly where annotated 3D data is even scarcer. To address this challenge we propose leveraging more readily available 2D annotations to complement the limited 3D data for HMP. In this work we introduce UniMotion a unified system for HMP capable of predicting both 2D and 3D future human pose sequences from either 2D and/or 3D previous pose sequences. The main advantage of UniMotion with respect to previous HMP systems is that it requires much less 3D training data obtaining remarkable accuracies even when trained with just a small portion of 3D data. To train UniMotion with unpaired 2D and 3D pose sequences we introduce a novel sequential bidirectional knowledge distillation module (SeqBi) which enables mutual learning between the 2D and 3D encoders. To tackle the data imbalance challenge we increase the diversity of the underrepresented 3D data by adding a small perturbation to the joint angles at the sequence level (RegPer). Extensive experiments on public datasets including general adult datasets (H3.6M 3DPW) and an elderly-specific dataset (TST) demonstrate that UniMotion achieves results comparable to or better than state-of-the-art methods while requiring only one-third of the 3D training data.
Related Material