-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Stone_2025_CVPR, author = {Stone, Austin and Soltau, Hagen and Geirhos, Robert and Yi, Xi and Xia, Ye and Cao, Bingyi and Chen, Kaifeng and Ogale, Abhijit and Shlens, Jonathon}, title = {Learning Visual Composition through Improved Semantic Guidance}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {3740-3750} }
Learning Visual Composition through Improved Semantic Guidance
Abstract
Visual imagery does not consist of solitary objects, but in-stead reflects the composition of a multitude of fluid con-cepts. While there have been great advances in visual repre-sentation learning, such advances have focused on buildingbetter representations for a small number of discrete objectsbereft of an understanding of how these objects are inter-acting. One can observe this limitation in representationslearned through captions or contrastive learning - wherethe learned model treats an image essentially as a bag ofwords. Several works have attempted to address this lim-itation through the development of bespoke architectures.In this work, we focus on simple and scalable approaches.In particular, we demonstrate that by improving weakly la-beled data, i.e. captions, we can vastly improve the perfor-mance of standard contrastive learning approaches. Previ-ous CLIP models achieved near chance rate on challengingtasks probing compositional learning. However, our sim-ple approach boosts performance of CLIP substantially andachieves state of the art results on compositional bench-marks such as ARO and SugarCrepe. Furthermore, weshowcase our results on a relatively new captioning bench-mark derived from DOCCI. We demonstrate through a se-ries of ablations that a standard CLIP model trained withenhanced data may demonstrate impressive performance onimage retrieval tasks.
Related Material