How does the Machine Perceive Depth for Indoor Single Images with CNN?

Wu, Yihong; Heng, Yuwen; Niranjan, Mahesan; Kim, Hansung

Yihong Wu, Yuwen Heng, Mahesan Niranjan, Hansung Kim; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 2735-2744

Abstract

Depth estimation from a single image is a challenging problem in computer vision because binocular disparity or motion information is not available in the given input. Whereas impressive performances have been reported in this area recently using end-to-end trained deep neural architectures, as to what cues in the images are being exploited by these black box systems is hard to know. To this end, in this work, we quantify the relative contributions of the known cues of depth in a single-image depth estimation setting using an indoor scene data set. Our work uses feature extraction techniques to relate the single features of shape, texture, colour and saturation, taken in isolation, to predict depth. We find that the shape of objects extracted by edge detection substantially contributes more than others in the indoor setting considered, while the other features also have contributions in varying degrees. These insights will help optimise depth estimation models, boosting their accuracy and robustness. They promise to broaden the practical applications of vision-based depth estimation.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Wu_2025_CVPR, author = {Wu, Yihong and Heng, Yuwen and Niranjan, Mahesan and Kim, Hansung}, title = {How does the Machine Perceive Depth for Indoor Single Images with CNN?}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {2735-2744} }