Learning Multi-Scene Absolute Pose Regression With Transformers

Yoli Shavit, Ron Ferens, Yosi Keller; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2733-2742


Absolute camera pose regression methods estimate the position and orientation of a camera by only using the captured image. A convolutional backbone with a multi-layer perceptron head is trained with images and pose labels to embed a single reference scene at a time. Recently, this framework was extended for learning multiple scenes with a single model by adding a multi-layer perceptron head per scene. In this work, we propose to learn multi-scene absolute camera pose regression with transformers, where encoders are used to aggregate activation maps with self-attention and deocoders transform latent features into candidate pose predictions in parallel, each associated with a different scene. This formulation allows our model to focus on general features that are informative for localization while embedding multiple scenes at once. We evaluate our method on commonly benchmarked indoor and outdoor datasets and show that it surpasses both multi-scene and single-scene absolute pose regressors.

Related Material

[pdf] [supp] [arXiv]
@InProceedings{Shavit_2021_ICCV, author = {Shavit, Yoli and Ferens, Ron and Keller, Yosi}, title = {Learning Multi-Scene Absolute Pose Regression With Transformers}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {2733-2742} }