S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data

Li, Xuyang; Hong, Danfeng; Chanussot, Jocelyn

Xuyang Li, Danfeng Hong, Jocelyn Chanussot; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 24088-24097

Abstract

In the expansive domain of computer vision a myriad of pre-trained models are at our disposal. However most of these models are designed for natural RGB images and prove inadequate for spectral remote sensing (RS) images. Spectral RS images have two main traits: (1) multiple bands capturing diverse feature information (2) spatial alignment and consistent spectral sequencing within the spatial-spectral dimension. In this paper we introduce Spatial-SpectralMAE (S2MAE) a specialized pre-trained architecture for spectral RS imagery. S2MAE employs a 3D transformer for masked autoencoder modeling integrating learnable spectral-spatial embeddings with a 90% masking ratio. The model efficiently captures local spectral consistency and spatial invariance using compact cube tokens demonstrating versatility to diverse input characteristics. This adaptability facilitates progressive pretraining on extensive spectral datasets. The effectiveness of S2MAE is validated through continuous pretraining on two sizable datasets totaling over a million training images. The pre-trained model is subsequently applied to three distinct downstream tasks with in-depth ablation studies conducted to emphasize its efficacy.

Related Material

[pdf]

[bibtex]

@InProceedings{Li_2024_CVPR, author = {Li, Xuyang and Hong, Danfeng and Chanussot, Jocelyn}, title = {S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {24088-24097} }