Synthesizing Visual Concepts as Vision-Language Programs

Wüst, Antonia; Stammer, Wolfgang; Shindo, Hikaru; Helff, Lukas; Dhami, Devendra Singh; Kersting, Kristian

Antonia Wüst, Wolfgang Stammer, Hikaru Shindo, Lukas Helff, Devendra Singh Dhami, Kristian Kersting; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 17346-17356

Abstract

Vision-Language models (VLMs) achieve strong performance on multimodal tasks but often fail at systematic visual reasoning, especially in inductive reasoning problems. Neuro-symbolic methods promise to address this by inducing interpretable logical programs from images, though they usually rely on rigid, domain-specific perception modules for this. We propose Vision-Language Programs (VLP), which combine the perceptual flexibility of VLMs with the systematic reasoning of symbolic program synthesis. Rather than embedding reasoning inside the VLM, VLP leverages the model to produce structured visual descriptions that are compiled into neuro-symbolic programs. The resulting programs execute directly on images, remain consistent with task constraints, and provide human-interpretable explanations that enable easy shortcut mitigation. Our experiments across synthetic and real-world datasets demonstrate that VLPs outperform both direct and structured prompting of VLMs, particularly on tasks that require complex logical reasoning.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Wust_2026_CVPR, author = {W\"ust, Antonia and Stammer, Wolfgang and Shindo, Hikaru and Helff, Lukas and Dhami, Devendra Singh and Kersting, Kristian}, title = {Synthesizing Visual Concepts as Vision-Language Programs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {17346-17356} }