ORBIT: Benchmarking SfM in the Wild with 360° Video

CVPR 2026 Submission

Anonymous1
Anonymous

The ORBIT structure from motion dataset. ORBIT includes a variety of clips that are challenging for SfM methods. Scenes with many dynamic objects (a) or large textureless areas (b) are challenging for traditional approaches like COLMAP.

Abstract

Structure-from-Motion (SfM) is a cornerstone of 3D perception, yet current methods often fail when applied to complex videos involving challenging camera motions or dynamic scenes. Compounding the problem, the field lacks reliable ground-truth benchmarks for such difficult scenarios, making it hard to gauge real-world progress, or pinpoint where improvements are most needed. To address this gap, we introduce a new benchmark for evaluating camera pose estimation. Our key insight is to leverage online panoramic 360° as a source of data from which to construct challenging clips, while still enabling robust ground-truth trajectory recovery. The panoramic nature of these videos provides richer visual context for tracking camera motion, even when parts of the view are affected by blur, motion, or dynamic objects. By tracking camera motion across full 360° videos, we crop and reproject selected portions to generate perspective-view clips that serve as our benchmark---ORBIT---a diverse collection of 100 video clips. Experiments show that COLMAP and other state-of-the-art SfM methods struggle to accurately estimate camera positions on our benchmark, indicating that it remains a challenging and open problem space for future research. As a result, ORBIT provides a valuable testbed where researchers can meaningfully compete and measure progress on truly challenging, real-world SfM problems.

Sample 360 videos and benchmark clips

Each frame at the top shows the 360 frame from the original online video, in the middle the four cube faces for cross validation are shown, at the buttom the test clip is shown which is included in our benchmark.

Frame ATE Comparisons

In order to show how different SFM methods, mainly MegaSaM, Colmap, and VGGT-Long perform on our benchmark we showcase sample clips of ORBIT. For each clip, we select the median ATE of all methods as a success threshold to differentiate between method's performances. The small box at the top left corner corresponding each method is green if the ATE of that frame for that method is less than the median ATE. The median ATE is written as the fourth box.

Estimated Camera Trajectories and Sample Clip Frames of ORBIT

Description

camera comparison. GT: , Estimate:

Input Video

Description

Trajectory comparison. GT: , Estimate:

Input Video

Description

camera comparison. GT: , Estimate:

Input Video

Description

camera comparison. GT: , Estimate:

Input Video

Description

camera comparison. GT: , Estimate:

Input Video

Description

camera comparison. GT: , Estimate:

Input Video

Challenge Categorization

Each clip manifests a range of challenges in ORBIT. We provide sub-category evaluations on 5 of the challenges, namely Low Texture, Low light, Presence of Crowd --Independent of camera moving Objects--, Presence of Parallel to camera moving Objects --PO--, and presence of Fluids. Please note that sub categories have overlap and checkout the supplementary.pdf file for more details.

Sample Clips Exhibiting Challenges
Challenges and Clips manifesting them

We report the ATE and RPE-R for each method on each subcategory. Based on the results, the most challenging category is the presence of an object moving alongside the camera, affecting MegaSaM and COLMAP significantly. MonST3R and ORB-SLAM2 on the other hand struggle most when faced with low texture scenes. The extent of VGGT-Long's struggle is narrower compared to other methods; it struggles most with low texture and the presence of moving objects either independent or parallel to camera. We also observe that using RoMo masking for MegaSaM usually improves the rotational estimate and ATE of MegaSaM significantly on parallel object and low light challenge while worsening the results on low texture scenes, which is in line with out expectations out of a motion masking method. Overall, the following table shows that ORBIT exposes a diverse set of challenges and is a valuable tool for analyzing the current state-of-the-art by highlighting their failure modes.

Challenge Comparisons
Table of method performances --ATE and RPE-R-- on challenging sub-categories.