Draft and Refine with Visual Experts

Jeong, Sungheon; Masukawa, Ryozo; Park, Jihong; Yun, Sanggeon; Huang, Wenjun; Chen, Hanning; Imani, Mahdi; Imani, Mohsen

Sungheon Jeong, Ryozo Masukawa, Jihong Park, Sanggeon Yun, Wenjun Huang, Hanning Chen, Mahdi Imani, Mohsen Imani; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 18816-18826

Abstract

While recent Large Vision-Language Models (LVLMs) exhibit impressive multimodal reasoning abilities, they often produce ungrounded, hallucinated responses by over-relying on linguistic priors rather than visual evidence. This critical limitation arises from the lack of a quantitative measure of how much these models actually rely on visual inputs during reasoning. We propose Draft and Refine (DnR), an agent framework driven by a novel question-conditioned utilization metric. This metric quantifies the model's actual reliance on visual evidence by first constructing a query-conditioned relevance map to localize question-specific evidence, and then assessing dependence through relevance-based probabilistic masking. Guided by this metric, the DnR agent refines its initial "draft" through targeted feedback from external visual experts. Each expert's output (e.g., boxes, masks) is rendered as visual cues on the image, and the LVLM is re-queried to select the response that yields the greatest improvement in utilization. This process strengthens visual grounding of predictions without retraining or architectural changes. Experiments across a broad range of VQA and captioning benchmarks demonstrate consistent accuracy gains and reduced hallucination. These results show that quantifying visual utilization provides a principled path for designing more interpretable and evidence-driven multimodal agent systems that effectively leverage visual experts.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Jeong_2026_CVPR, author = {Jeong, Sungheon and Masukawa, Ryozo and Park, Jihong and Yun, Sanggeon and Huang, Wenjun and Chen, Hanning and Imani, Mahdi and Imani, Mohsen}, title = {Draft and Refine with Visual Experts}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {18816-18826} }