-
[pdf]
[arXiv]
[bibtex]@InProceedings{Roy_2025_ICCV, author = {Roy, Rajarshi and Das, Devleena and Banerjee, Ankesh and Bhattacharjee, Arjya and Dasgupta, Kousik and Tripathi, Subarna}, title = {ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {6117-6123} }
ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way
Abstract
We introduce a training-free framework, ByDeWay, a training-free method to boost the performances of Multimodal Large Language Models. Specifically, ByDeWay leverages a novel prompting strategy, Layered-Depth-Based Prompting (LDP), that enhances the spatial reasoning and grounding capabilities of Multimodal Large Language Models (MLLMs). Our key insight is to inject structured spatial context derived from monocular depth estimation into the input prompts--without modifying any model parameters. By segmenting scenes into closest, mid-range, and farthest depth layers and generating region-specific captions using a grounded vision-language model, we produce explicit depth-aware textual descriptions. These descriptions are concatenated with image-question prompts to guide the model toward spatially grounded and hallucination-resistant outputs. Our method is lightweight, modular, and compatible with any black-box MLLM. Evaluations on hallucination-sensitive (POPE) and reasoning-intensive (GQA) tasks show consistent improvements across multiple MLLMs, demonstrating the effectiveness of depth-aware prompting in a zero-training setup.
Related Material
