ETR: An Efficient Transformer for Re-Ranking in Visual Place Recognition
Visual place recognition is to estimate the geographical location of a given image, which is usually addressed by recognizing its similar reference images from a database. The reference images are usually retrieved via similarity search using global descriptor, and the local descriptors are used to re-rank the initial retrieved candidates. The local descriptors re-ranking can significantly improve the accuracy of global retrieval but comes at a high computational cost. To achieve a good trade-off between accuracy and efficiency, we propose an Efficient Transformer for Re-ranking (ETR), utilizing both global and local descriptors to re-rank the top candidates in a single shot. In contrast to traditional re-ranking methods, we leverage self-attention to capture relationships between local descriptors in a single image and cross-attention to explore the similarity of the image pairs. We show that the proposed model can be regarded as a general re-ranking algorithm for significantly boosting the performance of other global-only retrieval methods. Extensive experimental results show that our method outperforms state-of-the-arts and is orders of magnitude faster in terms of computational efficiency.