Contrastive Learning of Image Representations Guided by Spatial Relations

Servant, Logan; Clément, Michaël; Wendling, Laurent; Kurtz, Camille

Logan Servant, Michaël Clément, Laurent Wendling, Camille Kurtz; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 2124-2133

Abstract

The spatial information contained in images is of critical importance for many computer vision tasks. Current state-of-the-art approaches dealing with spatially-related tasks such as spatial relationship recognition are typically trained in a supervised manner with semantic information carried by annotations. However datasets containing spatial relations (such as VisualGenome and SpatialSense) contain many errors or ambiguities at the label level (e.g. polysemy of the relations use of different reference frames across relations) which might deteriorate the representation learning step. The representations and image embeddings obtained from this training setup carry poor spatial information as they are entangled with other modalities such as semantic information. To deal with this issue we introduce C-SIP (Contrastive Spatial-Image Pre-training) an approach aiming to learn better spatially-aware image representations more in agreement with human perception of a scene where spatial information is structuring. This training strategy focuses on the alignment of the embeddings of an image encoder and a spatial encoder optimized from the image content in a self-supervised manner. We showcase that training a model with spatial information at its core thanks to C-SIP allows for better spatially-aware image representations on three downstream tasks. These representations can be used in a zero-shot setting such as image retrieval or fine-tuned on semantic tasks such as visual question answering and provide better results compared with supervised counterparts. Our source code can be found at https://github.com/Logan-wilson/CSIP.

Related Material

[pdf]

[bibtex]

@InProceedings{Servant_2025_WACV, author = {Servant, Logan and Cl\'ement, Micha\"el and Wendling, Laurent and Kurtz, Camille}, title = {Contrastive Learning of Image Representations Guided by Spatial Relations}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {2124-2133} }