-
[pdf]
[supp]
[bibtex]@InProceedings{Song_2026_CVPR, author = {Song, Yukun and Wang, Changwei and Pei, Xingtian and Xu, Shibiao and Xu, Wenhao and Chen, Shunpeng and Zhang, Yu and Zhang, Ke and Xu, Rongtao and Feng, Xuxiang and Wang, Pengyang}, title = {DialogueVPR: Towards Conversational Visual Place Recognition}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {41100-41110} }
DialogueVPR: Towards Conversational Visual Place Recognition
Abstract
Inspired by how humans communicate spatial information, language-guided geo-localization has gained significant traction for its intuitive and practical value. Despite this progress, most methods still rely on a static, one-shot retrieval paradigm, which fails to handle the ambiguity and incompleteness inherent in real-world natural language descriptions. We propose a paradigm shift to reasoning retrieval and introduce Dialogue Place Recognition (DlgPR), which casts localization as an interactive, dialogue-driven reasoning process. To support this new task, we present DlgQuest-Cities, the first large-scale dialogue-based benchmark for place recognition, and a unified reasoning framework that couples a cross-modal multi-level retriever with an intelligent questioner, DQ-pilot. DQ-pilot is trained in a curriculum: supervised fine-tuning on a curated DQ-cities-20k subset followed by reinforcement refinement on a harder DQ-cities-10k split via GRPO. Two task-aligned metrics guide learning: a Discriminative Difficulty Index (DDI) for curriculum sampling and a Positional Retrieval Gain (PRG) reward that directly measures retrieval improvement induced by a question. Experiments show this reasoning-based approach significantly outperforms baselines. The code will be made publicly available upon acceptance.
Related Material

