UniGeoCLIP: Unified Geospatial Contrastive Learning

Astruc, Guillaume; Trulls, Eduard; Hosang, Jan; Landrieu, Loic; Sarlin, Paul-Edouard

Guillaume Astruc, Eduard Trulls, Jan Hosang, Loic Landrieu, Paul-Edouard Sarlin; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 7847-7856

Abstract

The growing availability of co-located geospatial data spanning aerial imagery, street-level views, elevation models, text, and geographic coordinates offers a unique opportunity for multimodal representation learning. We introduce UniGeoCLIP, a massively multimodal contrastive framework to jointly align five complementary geospatial modalities in a single unified embedding space. Unlike prior approaches that fuse modalities or rely on a central pivot representation, our method performs all-to-all contrastive alignment, enabling seamless comparison, retrieval, and reasoning across arbitrary combinations of modalities. We further propose a scaled latitude-longitude encoder that improves spatial representation by capturing multi-scale geographic structure. Extensive experiments across downstream geospatial tasks demonstrate that UniGeoCLIP consistently outperforms single-modality contrastive models and coordinate-only baselines, highlighting the benefits of holistic multi-modal geospatial alignment.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Astruc_2026_CVPR, author = {Astruc, Guillaume and Trulls, Eduard and Hosang, Jan and Landrieu, Loic and Sarlin, Paul-Edouard}, title = {UniGeoCLIP: Unified Geospatial Contrastive Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {7847-7856} }