- [pdf] [supp]
Do What You Can, With What You Have: Scale-Aware and High Quality Monocular Depth Estimation Without Real World Labels
Learning robust and scale-aware monocular depth estimation (MDE) requires expensive data annotation efforts. Self-supervised approaches use unlabelled videos but, due to ambiguous photometric reprojection loss and no labelled supervision, produce inferior quality relative (scale ambiguous) depth maps with over-smoothed object boundaries. Approaches using synthetic training data suffer from the non-trivial domain adaptation problem; despite complicated unsupervised domain adaptation (UDA) techniques, these methods still do not generalize well to real datasets. This work presents a novel and effective training methodology to combine self-supervision from unlabelled monocular videos and dense supervision from the synthetic dataset synergistically without complicated UDA techniques. With our method, geometry and semantics are learned from monocular videos, whereas scale-awareness and qualitative attributes, e.g., sharp and smooth depth variations, that are crucial for practical use cases are learned from the synthetic dataset. Our method outperforms self-supervised, semi-supervised, and all the domain adaptation methods on standard benchmark datasets while being competitive with fully supervised methods. Furthermore, our method leads to qualitatively superior depth maps, which increases its practical utility compared to existing methods. We demonstrate this by applying our method to develop an MDE model for a real life application---DSLR-like shallow depth-of-field effect on smartphones. The new high quality synthetic depth dataset that we generate for this task will be available to the community.