Learning to Prompt CLIP for Monocular Depth Estimation: Exploring the Limits of Human Language

Dylan Auty, Krystian Mikolajczyk; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2039-2047

Abstract


CLIP is a significant vision-and-language training framework that has shown surprisingly general understanding of the world, with good performance in many open-ended tasks with little or no additional training. A recent technique has used CLIP to perform 0-shot Monocular Depth Estimation (MDE) by using depth-related prompts, but the use of human language in these prompts presents an unnecessary human bias. In this work, we use continuous learnable tokens in place of discrete human-language words to shed light on the problem. We achieve a significant boost in performance, and find that the learned tokens do not map neatly to depth-related human language, implying that CLIP's concept of depth is not succinctly expressible in human language. We posit that this may extend to other CLIP concepts, and believe that this finding will spark further research into both the use and interpretation of non-linguistic tokens in all open-ended scene interpretation tasks. Code is available at https://github. com/DylanAuty/PromptLearningCLIP-MDE

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Auty_2023_ICCV, author = {Auty, Dylan and Mikolajczyk, Krystian}, title = {Learning to Prompt CLIP for Monocular Depth Estimation: Exploring the Limits of Human Language}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {2039-2047} }