-
[pdf]
[bibtex]@InProceedings{Chappuis_2025_CVPR, author = {Chappuis, Christel and S\"umb\"ul, Gencer and Montariol, Syrielle and Lobry, Sylvain and Tuia, Devis}, title = {PAN-RSVQA: Vision Foundation Models as Pseudo-ANnotators for Remote Sensing Visual Question Answering}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3005-3016} }
PAN-RSVQA: Vision Foundation Models as Pseudo-ANnotators for Remote Sensing Visual Question Answering
Abstract
While the quantity of Earth observation (EO) images is constantly increasing, the benefits that can be derived from these images are still limited by the required technical expertise to run information extraction pipelines. Using natural language to break this barrier, Remote Sensing Visual Question Answering (RSVQA) aims to make EO images usable by a wider, general public. Traditional RSVQA methods utilize a visual encoder to extract generic features from images, which are then fused with the features of the questions entered by users. Given their multi-task nature, Vision foundation models (VFMs) allow to go beyond such generic visual features, and can be seen as pseudo-annotators extracting diverse sets of features from a collection of inter-related tasks (objects detected, segmentation maps, scene descriptions etc.). In this work, we propose PAN-RSVQA, a new method combining a VFM and its pseudo-annotations with RSVQA by leveraging a transformer-based multi-modal encoder. These pseudo-annotations bring diverse, naturally interpretable visual cues, as they are aligned with how humans reason about images: therefore, PAN-RSVQA not only exploits large-scale training of VFMs but also enables accurate and interpretable RSVQA. Experiments on two datasets show results on par with the state-of-the-art while enabling enhanced interpretation of the model predictions, which we analyze via sample visual perturbations and ablations of the role of each pseudo-annotator. In addition, PAN-RSVQA is modular and easily extendable to new pseudo-annotators from other VFMs.
Related Material