Self-Supervised Human Depth Estimation From Monocular Videos

Feitong Tan, Hao Zhu, Zhaopeng Cui, Siyu Zhu, Marc Pollefeys, Ping Tan; The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 650-659

Abstract


Previous methods on estimating detailed human depth often require supervised training with 'ground truth' depth data. This paper presents a self-supervised method that can be trained on YouTube videos without known depth, which makes training data collection simple and improves the generalization of the learned network. The self-supervised learning is achieved by minimizing a photo-consistency loss, which is evaluated between a video frame and its neighboring frames warped according to the estimated depth and the 3D non-rigid motion of the human body. To solve this non-rigid motion, we first estimate a rough SMPL model at each video frame and compute the non-rigid body motion accordingly, which enables self-supervised learning on estimating the shape details. Experiments demonstrate that our method enjoys better generalization, and performs much better on data in the wild.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Tan_2020_CVPR,
author = {Tan, Feitong and Zhu, Hao and Cui, Zhaopeng and Zhu, Siyu and Pollefeys, Marc and Tan, Ping},
title = {Self-Supervised Human Depth Estimation From Monocular Videos},
booktitle = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2020}
}