Efficient Visual Place Recognition Through Multimodal Semantic Knowledge Integration

Sitao Zhang, Hongda Mao, Qingshuang Chen, Yelin Kim; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 5601-5610

Abstract


Visual place recognition is crucial for autonomous navigation and robotic mapping. Current methods struggle with perceptual aliasing and computational inefficiency. We present SemVPR, a novel approach integrating multimodal semantic knowledge into VPR. By leveraging a pre-trained vision-language model as a teacher during the training phase, SemVPR learns local visual and semantic descriptors simultaneously, effectively mitigating perceptual aliasing through semantic-aware aggregation without extra inference cost. The proposed nested descriptor learning strategy generates a series of ultra-compact global descriptors, reduced by approximately compared to state-of-the-art methods, in a coarse-to-fine manner, eliminating the need for offline dimensionality reduction or training multiple models. Extensive experiments across various VPR benchmarks demonstrate that SemVPR consistently outperforms state-of-the-art methods with significantly lower computational costs, rendering its feasibility for latency-sensitive scenarios in real-world applications.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhang_2025_ICCV, author = {Zhang, Sitao and Mao, Hongda and Chen, Qingshuang and Kim, Yelin}, title = {Efficient Visual Place Recognition Through Multimodal Semantic Knowledge Integration}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {5601-5610} }