-
[pdf]
[supp]
[bibtex]@InProceedings{Liu_2025_CVPR, author = {Liu, Zixuan and Jiang, Guangkai and Khajavi, Siavash}, title = {LLaVA-SCo: Teach Vision Language Models to Self-Correct}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3445-3454} }
LLaVA-SCo: Teach Vision Language Models to Self-Correct
Abstract
Large Language Models (LLMs) have demonstrated remarkable progress in self-correction, as seen in models like DeepSeek-R1. However, the current Vision-Language Models (VLMs) often struggle with self-correction in complex question-answering tasks. Furthermore, reinforcement learning (RL)-based approaches, while effective, incur substantial computational costs. In this work, we propose LLaVA-SCo, a novel VLM designed for efficient self-correction without relying on RL. Instead, we introduce a self-correction stage following the sequential reasoning steps of LLaVA-CoT, refining model outputs through a two-turn self-correction mechanism trained with supervised fine-tuning. To support this, we construct a large-scale dataset enriched with refined reasoning annotations to enhance correction capabilities. Experiments on multimodal reasoning benchmarks demonstrate that LLaVA-SCo outperforms its base model by 2.7%, exhibiting significant improvements in reasoning capabilities. Additionally, evaluations using GPT-4o indicate that responses after self-correction exhibit clearer structure, improved comprehension, and are preferred by GPT-4o. Furthermore, self-correction performance metrics validate that LLaVA-SCo effectively refines its reasoning, achieving consistent accuracy gains while minimizing reversal errors, confirming its ability to systematically improve responses through self-correction.
Related Material

