Distilling Monocular Foundation Model for Fine-grained Depth Completion

Yingping Liang, Yutao Hu, Wenqi Shao, Ying Fu; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 22254-22265

Abstract


Depth completion involves predicting dense depth maps from sparse LiDAR inputs, a critical task for applications such as autonomous driving and robotics. However, sparse depth annotations from sensors limit the availability of dense supervision, which is necessary for learning detailed geometric features. To overcome this limitation, we propose a two-stage knowledge distillation framework that leverages powerful monocular foundation models to provide dense supervision for depth completion. In the first stage, we introduce a pre-training strategy that generates diverse training data from natural images to distill geometric knowledge to depth completion. Specifically, we simulate LiDAR scans by utilizing monocular depth and mesh reconstruction, thereby creating training data without requiring ground-truth depth. Nonetheless, monocular depth estimation suffers from inherent scale ambiguity in real-world settings. To address this, in the second stage, we employ a scale- and shift-invariant loss (SSI Loss) to learn real-world scales when fine-tuning on real-world datasets. Our two-stage distillation framework enables depth completion models to harness the strengths of monocular foundation models. Experimental results show that models trained with our two-stage distillation framework achieve top-ranked performance on the KITTI benchmark, demonstrating improvements in both quantitative and qualitative metrics.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Liang_2025_CVPR, author = {Liang, Yingping and Hu, Yutao and Shao, Wenqi and Fu, Ying}, title = {Distilling Monocular Foundation Model for Fine-grained Depth Completion}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {22254-22265} }