CobraVPS: Code Template Optimization for Better Question Reasoning Accuracy with Visual Program Synthesis

Jiajing Chen, Xiu Zhang, Yang Li, Renyu Zhang, Yujie Dong, Senem Velipasalar, Jing Zhang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 4390-4399

Abstract


End-to-end Visual Question Answering (VQA) models take both an image and a question as input and directly provide an answer to the question as output. Recently, visual program synthesis has introduced a new approach to VQA tasks, wherein a large language model generates a piece of code based on the input question, which is then executed on the input image to produce the answer. Compared to end-to-end models, the visual program synthesis method offers better explainability and flexibility. While existing visual program synthesis methods focus on generating code with correct logic, in this work, we first show that a logically correct code does not always provide the right answer to the input question. Based on this insight, we propose CobraVPS (Code Template Optimization for Better Question Reasoning Accuracy with Visual Program Synthesis) for VQA tasks. Experiments conducted on three different datasets, namely GQA, VQAv2, and Winoground, show that the proposed CobraVPS outperforms the state-of-the-art baseline by up to 5.4% in accuracy. CobraVPS does not require human annotation or model fine-tuning, and demonstrates stable performance across different Code LLMs.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Chen_2025_ICCV, author = {Chen, Jiajing and Zhang, Xiu and Li, Yang and Zhang, Renyu and Dong, Yujie and Velipasalar, Senem and Zhang, Jing}, title = {CobraVPS: Code Template Optimization for Better Question Reasoning Accuracy with Visual Program Synthesis}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4390-4399} }