CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification

Jinheng Ji, Jiahui Qu, Wenqian Dong, Yunsong Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 23021-23030

Abstract


Fine-tuning Vision-Language Models (VLMs) trained on large-scale datasets of natural image-text pairs has demonstrated impressive performance for various downstream tasks. However, their fine-tuning for remote sensing (RS) tasks faces dual barriers: (1) Data-level barrier caused by the fundamental modality gap between natural imagery and RS data, and (2) Task-level barrier stemming from the requirement for multi-source interaction modeling capabilities. This paper proposes a Cross-modal Fusion Interactive Prompt Tuning (CF-IPT) method to fine-tune CLIP for multi-source RS image classification tasks. It aims to leverage the prompt learning framework to transfer the alignment target of the text branch shifts from natural images to multi-source RS images. Specifically, we design a Multi-source Interactive Fusion-guided Spectral-Spatial Prompt Generation (MFPG) module, which enables cross-modal feature interaction to generate a prompt matrix that preserves the original spectral and spatial information while performing adaptive multi-scale fusion to address the multi-source image adaptation problem. Subsequently, a Spectral-Spatial Prompt-guided Visual-Text Prompt Interaction (V-TPI) Strategy is proposed, which leverages spectral-spatial prompt matrices to guide visual-textual prompt interaction and inject RS-specific information into both branches of CLIP, ultimately enabling multi-source RS image-text representation alignment. The proposed approach performs the downstream task of multi-source RS image classification with merely 0.76% of CLIP's parameters. It is evaluated on several widely used datasets, demonstrating the effectiveness of the proposed approach.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ji_2026_CVPR, author = {Ji, Jinheng and Qu, Jiahui and Dong, Wenqian and Li, Yunsong}, title = {CF-IPT: Cross-Modal Fusion Interactive Prompt Tuning of Vision-Language Pre-Trained Model for Multisource Remote Sensing Data Classification}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {23021-23030} }