- [pdf] [code]
Cross-View Self-Fusion for Self-Supervised 3D Human Pose Estimation in the Wild
Human pose estimation methods have recently shown remarkable results with supervised learning that requires large amounts of labeled training data. However, such training data for various human activities does not exist since 3D annotations are acquired with traditional motion capture systems that usually require a controlled indoor environment. To address this issue, we propose a self-supervised approach that learns a monocular 3D human pose estimator from unlabeled multi-view images by using multi-view consistency constraints. Furthermore, we refine inaccurate 2D poses, which adversely affect 3D pose predictions, using the property of canonical space without relying on camera calibration. Since we do not require camera calibrations to leverage the multi-view information, we can train a network from in-the-wild environments. The key idea is to fuse the 2D observations across views and combine predictions from the observations to satisfy the multi-view consistency during training. We outperform state-of-the-art methods in self-supervised learning on the two benchmark datasets Human3.6M and MPI-INF-3DHP as well as on the in-the-wild dataset SkiPose.