DepthBLIP-2: Leveraging Language to Guide BLIP-2 in Understanding Depth Information

Wei Chen, Changyong Shi, Chuanxiang Ma, Wenhao Li, Shulei Dong; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2939-2953

Abstract


In recent years, visual language models have made significant advancements in the fields of computer vision and natural language processing. The BLIP-2 model effectively bridges modality gaps through its lightweight Q-Former, demonstrating excellent results with low training costs and highlighting the potential development directions for visual language models. However, applying BLIP-2 to more complex quantized target tasks, such as monocular depth estimation, presents challenges. In this paper, we propose a method for monocular depth estimation using BLIP-2. Our approach draws inspiration from DepthCLIP's use of language-guided models to comprehend depth information, leveraging the Q-Former module for modality fusion. Additionally, we introduce an adaptive depth bin to enhance the model's robustness against quantized distances. We name our method DepthBLIP-2 and make our code publicly available at: https://github.com/especiallyW/DepthBLIP-2.

Related Material


[pdf]
[bibtex]
@InProceedings{Chen_2024_ACCV, author = {Chen, Wei and Shi, Changyong and Ma, Chuanxiang and Li, Wenhao and Dong, Shulei}, title = {DepthBLIP-2: Leveraging Language to Guide BLIP-2 in Understanding Depth Information}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2939-2953} }