CaBins: CLIP-based Adaptive Bins for Monocular Depth Estimation

Eunjin Son, Sang Jun Lee; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4557-4567

Abstract


Traditional deep-learning models use pre-trained knowledge on large-scale datasets to fine-tune the model. This strategy significantly improves the performance of downstream tasks such as object detection and segmentation. Recently vision-language (VL) models that jointly train an image encoder and a text encoder have gained attention. Notably CLIP which employs contrastive learning for classification contributed significantly to establishing the foundation for the VL model paradigm. In depth estimation several CLIP-based models have been proposed that use images and texts called semantic bins. However it is questionable whether these human-set semantic bins are reasonable. In this work we propose a network for monocular depth estimation leveraging CLIP's pre-trained knowledge. Our model employs a regression-classification formulation predicting depth through a linear combination of depth candidates and a probability map derived from the similarity score between image embedding and text embedding. Unlike previous works relying on human-set semantic bins for the text embedding our model converts the predicted depth candidates into distance classes using the CaBins module. Moreover we modify CLIP's image encoder which is designed for classification to address the dense prediction task. Experiments were conducted on the NYU-Depth V2 and KITTI datasets. We compared the performance of our model with CLIP-based as well as unimodal monocular depth estimation models. Our proposed model outperformed previous CLIP-based models across all evaluation metrics and showed high-quality boundary predictions on both datasets. Our model is available at https://github.com/EunjinSon1/CaBins.

Related Material


[pdf]
[bibtex]
@InProceedings{Son_2024_CVPR, author = {Son, Eunjin and Lee, Sang Jun}, title = {CaBins: CLIP-based Adaptive Bins for Monocular Depth Estimation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4557-4567} }