-
[pdf]
[bibtex]@InProceedings{Do_2025_CVPR, author = {Do, Minh Kha and Han, Kang and Lai, Phu and Phan, Khoa T. and Xiang, Wei}, title = {RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {7427-7436} }
RobSense: A Robust Multi-modal Foundation Model for Remote Sensing with Static, Temporal, and Incomplete Data Adaptability
Abstract
Foundation models for remote sensing have garnered increasing attention for their strong performance across various observation tasks. However, current models lack robustness in managing diverse input types and handling incomplete data in downstream tasks. In this paper, we propose RobSense, a robust multi-modal foundation model for Multi-spectral and Synthetic Aperture Radar data. RobSense is designed with modular components and pre-trained by a combination of temporal multi-modal alignment and masked autoencoder strategies on a huge-scale dataset. Therefore, it can effectively support diverse input types, from static to temporal, uni-modal to multi-modal. To further handle the incomplete data, we incorporate two uni-modal latent reconstructors that recover rich representations from incomplete inputs, addressing variability in spectral bands and temporal sequence irregularities. Extensive experiments demonstrate that RobSense consistently outperforms state-of-the-art baselines on complete datasets across four input types for segmentation, classification, and change detection. On incomplete datasets, RobSense outperforms the baselines by considerably larger margins when the missing rate increases. Project page: https://ikhado.github.io/robsense/
Related Material