-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Mishra_2025_ICCV, author = {Mishra, Samarth and Saenko, Kate and Saligrama, Venkatesh}, title = {SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {6292-6302} }
SCRAMBLe : Enhancing Multimodal LLM Compositionality with Synthetic Preference Data
Abstract
Compositionality, or correctly recognizing scenes as compositions of atomic visual concepts, remains difficult for multimodal large language models (MLLMs). Even state of the art MLLMs such as GPT-4o can make mistakes in distinguishing compositions like "dog chasing cat" vs "cat chasing dog". While on Winoground, a benchmark for measuring such reasoning, MLLMs have made significant progress, they are still far from a human's performance. We show that compositional reasoning in these models can be improved by elucidating such concepts via data, where a model is trained to prefer the correct caption for an image over a close but incorrect one. We introduce SCRAMBLe: Synthetic Compositional Reasoning Augmentation of MLLMs with Binary preference Learning, an approach for preference tuning open-weight MLLMs on synthetic preference data generated in a fully automated manner from existing image-caption data. SCRAMBLe holistically improves these MLLMs' compositional reasoning capabilities which we can see through significant improvements across multiple vision language compositionality benchmarks, as well as smaller but significant improvements on general question answering tasks. As a sneak peek, SCRAMBLe tuned Molmo-7B model improves on Winoground from 49.5% to 54.8% (best reported to date), while improving by 1% on more general visual question answering tasks. The synthetic dataset, SCRAMBLe-tuned MLLMs and the code for training and evaluation will be made publicly available.
Related Material
