-
[pdf]
[bibtex]@InProceedings{Chen_2024_ACCV, author = {Chen, Wei and Shi, Changyong and Ma, Chuanxiang and Li, Wenhao and Dong, Shulei}, title = {DepthBLIP-2: Leveraging Language to Guide BLIP-2 in Understanding Depth Information}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2939-2953} }
DepthBLIP-2: Leveraging Language to Guide BLIP-2 in Understanding Depth Information
Abstract
In recent years, visual language models have made significant advancements in the fields of computer vision and natural language processing. The BLIP-2 model effectively bridges modality gaps through its lightweight Q-Former, demonstrating excellent results with low training costs and highlighting the potential development directions for visual language models. However, applying BLIP-2 to more complex quantized target tasks, such as monocular depth estimation, presents challenges. In this paper, we propose a method for monocular depth estimation using BLIP-2. Our approach draws inspiration from DepthCLIP's use of language-guided models to comprehend depth information, leveraging the Q-Former module for modality fusion. Additionally, we introduce an adaptive depth bin to enhance the model's robustness against quantized distances. We name our method DepthBLIP-2 and make our code publicly available at: https://github.com/especiallyW/DepthBLIP-2.
Related Material