-
[pdf]
[bibtex]@InProceedings{Barbier_2025_CVPR, author = {Barbier, Cl\'ement and Abeloss, Baptiste and Herbin, St\'ephane}, title = {Bridging the Modality Gap: Training-free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3057-3066} }
Bridging the Modality Gap: Training-free Adaptation of Vision-Language Models for Remote Sensing via Visual Prototypes
Abstract
Large-scale Vision-Language Models (VLMs) have demonstrated remarkable few-shot learning capabilities across various visual tasks. However, effectively adapting these models to remote sensing, a domain characterized by specialized object appearances and scarce labeled data, remains non-trivial. In this work, we present a training-free adaptation strategy that employs region-level visual prototypes for object detection in remote sensing imagery. Instead of relying on textual prompts, we directly derive representative embeddings from a small number of annotated bounding boxes, capturing domain-specific characteristics that generic language encoders may overlook. To compensate for the resulting modality gap between region-region and region-text similarities, we introduce an affine normalization step that re-calibrates prototype-based scores without any model fine-tuning. We evaluate our method on the DIOR and NWPU-VHR10 benchmarks, demonstrating consistent and substantial improvements over previous training-free approaches. Moreover, we offer an in-depth analysis of different prototype construction and aggregation strategies, revealing how carefully chosen protocols can further strengthen few-shot detection in remote sensing.
Related Material