-
[pdf]
[supp]
[code]
[bibtex]@InProceedings{Shi_2022_ACCV, author = {Shi, Yujiao and Yu, Xin and Wang, Shan and Li, Hongdong}, title = {CVLNet: Cross-View Feature Correspondence Learning for Video-based Camera Localization}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {652-669} }
CVLNet: Cross-View Feature Correspondence Learning for Video-based Camera Localization
Abstract
This paper tackles the problem of Cross-view Video-based camera Localization (CVL). The task is to localize a query camera by leveraging information from its past observations, i.e., a continuous sequence of images observed at previous time stamps, and matching them to a large overhead-view satellite image.
The critical challenge of this task is to learn a powerful global feature descriptor for the sequential ground-view images while considering its domain alignment with reference satellite images. For this purpose, we introduce CVLNet, which first projects the sequential ground-view images into an overhead view by exploring the ground-and-overhead geometric correspondences and then leverages the photo consistency among the projected images to form a global representation.
In this way, the cross-view domain differences are bridged.
Since the reference satellite images are usually pre-cropped and regularly sampled, there is always a misalignment between the query camera location and its matching satellite image center. Motivated by this, we propose estimating the query camera's relative displacement to a satellite image before similarity matching.
In this displacement estimation process, we also consider the uncertainty of the camera location. For example, a camera is unlikely to be on top of trees.
To evaluate the performance of the proposed method, we collect satellite images from Google Map for the KITTI dataset and construct a new cross-view video-based localization benchmark dataset, KITTI-CVL. Extensive experiments have demonstrated the effectiveness of video-based localization over single image-based localization and the superiority of each proposed module over other alternatives.
Related Material