-
[pdf]
[supp]
[bibtex]@InProceedings{Choudhury_2026_WACV, author = {Choudhury, Shabnam and Salunkhe, Yash and Rajan, Vaibhav and Chaudhuri, Subhasis and Banerjee, Biplab}, title = {X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {4355-4364} }
X-JEPA: A Novel Joint Learning Cross-Modal Predictive Alignment Framework for Remote Sensing Image Retrieval
Abstract
The growing scale and heterogeneity of remote sensing (RS) imagery demand robust, scalable frameworks for content-based image retrieval across sensor modalities. We introduce X-JEPA, a novel predictive self-supervised architecture explicitly designed for cross-modal remote sensing image retrieval (RS-CMIR), and the first to extend joint embedding predictive paradigms beyond unimodal domains. Unlike prior contrastive or reconstruction-based methods, X-JEPA formulates representation learning as a latent forecasting task: predicting the semantic embedding of a target modality given context from another. To enforce modality-invariant alignment, we propose a geometry-aware Prediction Space Alignment (PSA) loss, which captures the structure of the latent space without requiring pixel-level reconstruction or modality pairing. We evaluate X-JEPA on two large-scale benchmarks--BEN-14K (Sentinel-1/Sentinel-2) and fMoW (RGB/Sentinel) across both unimodal and cross-modal retrieval tasks. X-JEPA consistently outperforms state-of-the-art self-supervised baselines, including MAE, SatMAE, CrossMAE, CSMAE-SESD, CROMA, SkySense, DeCUR and REJEPA, achieving up to 11.0% F1-score improvement in cross-modal retrieval and 9.8% in unimodal settings. Despite its high retrieval accuracy, the model remains lightweight, requiring fewer parameters and yielding 8-10% F1-score gains on average, establishing a new state-of-the-art for scalable, sensor-agnostic RS-CMIR.
Related Material
