Learning Visual Composition through Improved Semantic Guidance

Stone, Austin; Soltau, Hagen; Geirhos, Robert; Yi, Xi; Xia, Ye; Cao, Bingyi; Chen, Kaifeng; Ogale, Abhijit; Shlens, Jonathon

Austin Stone, Hagen Soltau, Robert Geirhos, Xi Yi, Ye Xia, Bingyi Cao, Kaifeng Chen, Abhijit Ogale, Jonathon Shlens; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 3740-3750

Abstract

Visual imagery does not consist of solitary objects, but in-stead reflects the composition of a multitude of fluid con-cepts. While there have been great advances in visual repre-sentation learning, such advances have focused on buildingbetter representations for a small number of discrete objectsbereft of an understanding of how these objects are inter-acting. One can observe this limitation in representationslearned through captions or contrastive learning - wherethe learned model treats an image essentially as a bag ofwords. Several works have attempted to address this lim-itation through the development of bespoke architectures.In this work, we focus on simple and scalable approaches.In particular, we demonstrate that by improving weakly la-beled data, i.e. captions, we can vastly improve the perfor-mance of standard contrastive learning approaches. Previ-ous CLIP models achieved near chance rate on challengingtasks probing compositional learning. However, our sim-ple approach boosts performance of CLIP substantially andachieves state of the art results on compositional bench-marks such as ARO and SugarCrepe. Furthermore, weshowcase our results on a relatively new captioning bench-mark derived from DOCCI. We demonstrate through a se-ries of ablations that a standard CLIP model trained withenhanced data may demonstrate impressive performance onimage retrieval tasks.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Stone_2025_CVPR, author = {Stone, Austin and Soltau, Hagen and Geirhos, Robert and Yi, Xi and Xia, Ye and Cao, Bingyi and Chen, Kaifeng and Ogale, Abhijit and Shlens, Jonathon}, title = {Learning Visual Composition through Improved Semantic Guidance}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {3740-3750} }