-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Tan_2026_CVPR, author = {Tan, Kaiyuan and Shen, Yingying and Zhu, Ziyue and Tu, Mingfei and Zhu, Haohui and Sun, Haiyang and Wang, Bing and Chen, Guang and Ye, Hangjun}, title = {UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {21849-21859} }
UFO: Unifying Feed-Forward and Optimization-based Methods for Large Driving Scene Modeling
Abstract
Dynamic driving scene modeling is critical for autonomous driving simulation and closed-loop learning. While recent feed-forward methods offer fast inference through data-driven priors, they struggle with long-range driving sequences due to quadratic complexity in sequence length and restrictive assumptions for dynamic objects. We propose UFO, a novel recurrent paradigm that combines the strengths of optimization-based and feed-forward methods for efficient long-range 4D modeling. Our approach maintains a persistent 4D scene representation composed of scene tokens that are progressively refined as new frames arrive, enabling future observations to correct earlier uncertain predictions. A visibility-based filtering mechanism exploits the locality of 3D-to-pixel correspondence, reducing complexity from quadratic to near-linear in sequence length. For dynamic objects, we introduce object pose-guided modeling with learned lifespans, enabling complex long-range motion without kinematic assumptions. Experiments on the Waymo Open Dataset demonstrate that UFO significantly outperforms both per-scene optimization and feed-forward baselines across 2s, 8s, and zero-shot 16s sequences.
Related Material

