Scaling View Synthesis Transformers

Kim, Evan; Ryu, Hyunwoo; Mitchel, Thomas W.; Sitzmann, Vincent

Evan Kim, Hyunwoo Ryu, Thomas W. Mitchel, Vincent Sitzmann; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 28893-28902

Abstract

Geometry-free view synthesis transformers have recently achieved state-of-the-art performance in Novel View Synthesis (NVS), outperforming traditional approaches that rely on explicit geometry modeling. Yet the factors governing their scaling with compute remain unclear. We present a systematic study of scaling laws for view synthesis transformers and derive design principles for training compute-optimal NVS models. Contrary to prior findings, we show that encoder-decoder architectures can be compute-optimal; we trace earlier negative results to suboptimal architectural choices and comparisons across unequal training compute budgets. Across several compute levels, we demonstrate that our encoder-decoder architecture, which we call the Scalable View Synthesis Model (SVSM), scales as effectively as decoder-only models, achieves a superior performance-compute Pareto frontier, and surpasses the previous state-of-the-art on real-world NVS benchmarks with substantially reduced training compute.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Kim_2026_CVPR, author = {Kim, Evan and Ryu, Hyunwoo and Mitchel, Thomas W. and Sitzmann, Vincent}, title = {Scaling View Synthesis Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {28893-28902} }