Detail-Preserving Self-Supervised Monocular Depth With Self-Supervised Structural Sharpening
We propose to close the gap between self-supervised and fully-supervised methods for the single view depth estimation (SVDE) task in terms of the levels of detail and sharpness in the estimated depth maps. Detailed SVDE is challenging as even fully-supervised methods struggle to obtain detail-preserving depth estimates. Recent works have proposed multi-scale boosting techniques and exploiting semantic masks to improve the structural information in the estimated depth maps. In contrast, our proposed method in this paper yields detail-preserving depth estimates from a single forward pass without increasing the computational cost or requiring additional data. We achieve this by exploiting a missing component in SVDE, Self-Supervised Structural Sharpening, referred to as S4. S4 is a mechanism that encourages a similar level of detail between the RGB input and the depth/disparity output. To this extent, we propose a novel DispNet-S4 network for detail-preserving SVDE. Our network exploits un-blurring and un-noising tasks of clean input images for learning S4 without the need for either additional data (e.g., segmentation masks, matting maps, etc.) or advanced network blocks (attention, transformers, etc.). The recovered structural details in the un-blurring and un-noising operations are transferred to the estimated depth maps via adaptive convolutions to yield structurally sharpened depths that are selectively used for self-supervision. We provide extensive experimental results and ablation studies that show our proposed DispNet-S4 network can yield fine details in the depth maps while achieving state-of-the-art quantitative metrics for the challenging KITTI dataset.