MMGeo: Multimodal Compositional Geo-Localization for UAVs

Yuxiang Ji, Boyong He, Zhuoyue Tan, Liaoni Wu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 25165-25175

Abstract


Multimodal geo-localization methods can inherently overcome the limitations of unimodal sensor systems by leveraging complementary information from different modalities. However, existing retrieval-based methods rely on a comprehensive multimodal database, which is often challenging to fulfill in practice. In this paper, we introduce a more practical problem for localizing drone-view images by collaborating multimodal data within a satellite-view reference map, which integrates multimodal information while avoiding the need for an extensive multimodal database. We present MMGeo that learns to push the composition of multimodal representations to the target reference map through a unified framework. By utilizing a comprehensive multimodal query (image, point cloud/depth/text), we can achieve more robust and accurate geo-localization, especially in unknown and complex environments. Additionally, we extend two visual geo-localization datasets GTA-UAV and UAV-VisLoc to multi-modality, establishing the first UAV geo-localization datasets that combine image, point cloud, depth and text data. Experiments demonstrate the effectiveness of MMGeo for UAV multimodal compositional geo-localization, as well as the generalization capabilities to real-world scenarios. The code and dataset are at https://github.com/Yux1angJi/MMGeo.

Related Material


[pdf]
[bibtex]
@InProceedings{Ji_2025_ICCV, author = {Ji, Yuxiang and He, Boyong and Tan, Zhuoyue and Wu, Liaoni}, title = {MMGeo: Multimodal Compositional Geo-Localization for UAVs}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {25165-25175} }