Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Liao, Yuan-Hong; Mahmood, Rafid; Fidler, Sanja; Acuna, David

Yuan-Hong Liao, Rafid Mahmood, Sanja Fidler, David Acuna; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 14667-14678

Abstract

Improving semantic grounding in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore self-correction in VLMs focusing on semantic grounding. We find that VLMs can correct their own semantic grounding mistakes when properly prompted and framed for the task, without any fine-tuning or even access to oracle feedback. We also introduce a self-correction framework in an iterative setting which consistently improves performance across all models investigated. Overall, we show that iterative self-correction consistently improves VLM performance in semantic grounding by up to 8.4 accuracy points across all models investigated, without requiring fine-tuning, additional architectural changes, or external data. Our exploration of self-correction also reveals that, even after several rounds of feedback, strong models like GPT-4V and GPT-4o retain limited capability in leveraging oracle feedback, suggesting promising directions for further research.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Liao_2025_CVPR, author = {Liao, Yuan-Hong and Mahmood, Rafid and Fidler, Sanja and Acuna, David}, title = {Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {14667-14678} }