Monocular 3D Human Pose Estimation by Predicting Depth on Joints

Bruce Xiaohan Nie, Ping Wei, Song-Chun Zhu; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3447-3455

Abstract


This paper aims at estimating full-body 3D human poses from monocular images of which the biggest challenge is the inherent ambiguity introduced by lifting the 2D pose into 3D space. We propose a novel framework focusing on reducing this ambiguity by predicting the depth of human joints based on 2D human joint locations and body part images. Our approach is built on a two-level hierarchy of Long Short-Term Memory (LSTM) Networks which can be trained end-to-end. The first level consists of two components: 1) a skeleton-LSTM which learns the depth information from global human skeleton features; 2) a patch-LSTM which utilizes the local image evidence around joint locations. The both networks have tree structure defined on the kinematic relation of human skeleton, thus the information at different joints is broadcast through the whole skeleton in a top-down fashion. The two networks are first pre-trained separately on different data sources and then aggregated in the second layer for final depth prediction. The empirical evaluation on Human3.6M and HHOI dataset demonstrates the advantage of combining global 2D skeleton and local image patches for depth prediction, and our superior quantitative and qualitative performance relative to state-of-the-art methods.

Related Material


[pdf]
[bibtex]
@InProceedings{Nie_2017_ICCV,
author = {Xiaohan Nie, Bruce and Wei, Ping and Zhu, Song-Chun},
title = {Monocular 3D Human Pose Estimation by Predicting Depth on Joints},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2017}
}