Visual Geometry Grounded Novel-View Acoustic Synthesis

Jay Polra, Dhwanil Chauhan, Wenjun Huang, Kyle Toth, Xianhui Wang, Yang Ni; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 7435-7444

Abstract


We present the first unified framework for novel-view acoustic synthesis that entirely bypasses explicit 3D visual rendering and costly photogrammetry by directly grounding spatial audio generation in feed-forward visual geometry. We show its capability to synthesize accurate and immersive spatial audio in 3D spaces without requiring viewpoint images, dense point maps, or any ground-truth poses for input video. Our motivation stems from the observation that existing methods suffer from limited geometry cues, requirements on simulated acoustic environments, inefficient multimodal visual-audio learning, and reliance on costly and unstable photogrammetry pipelines. Our proposed approach overcomes these challenges collectively by blending the learned visual representation and geometry from feed-forward scene encoding and jointly conditioning on visual and audio features in geometry-aware binauralization. In particular, we design the Geometry Grounded Acoustic Decoder to dynamically attend to cross-modal features, which embed local and global geometries in audio and visual modalities. Extensive experiments show that our framework outperforms prior work across various benchmarks in high-quality, viewpoint-accurate spatial audio synthesis, without requiring time-consuming explicit rendering of novel-view images or dense point maps.

Related Material


[pdf]
[bibtex]
@InProceedings{Polra_2026_CVPR, author = {Polra, Jay and Chauhan, Dhwanil and Huang, Wenjun and Toth, Kyle and Wang, Xianhui and Ni, Yang}, title = {Visual Geometry Grounded Novel-View Acoustic Synthesis}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {7435-7444} }