VGGSfM: Visual Geometry Grounded Deep Structure From Motion

Jianyuan Wang, Nikita Karaev, Christian Rupprecht, David Novotny; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 21686-21697

Abstract


Structure-from-motion (SfM) is a long-standing problem in the computer vision community which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints registering images triangulating 3D points and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g. keypoint matching) but are still based on the original non-differentiable pipeline. Instead we propose a new deep SfM pipeline where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end we introduce new mechanisms and simplifications. First we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks which eliminates the need for chaining pairwise matches. Furthermore we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets CO3D IMC Phototourism and ETH3D.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wang_2024_CVPR, author = {Wang, Jianyuan and Karaev, Nikita and Rupprecht, Christian and Novotny, David}, title = {VGGSfM: Visual Geometry Grounded Deep Structure From Motion}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {21686-21697} }