SounDiT: Geo-Contextual Soundscape-to-Landscape Generation

Wang, Junbo; Tan, Haofeng; Liao, Bowen; Jiang, Albert; Fei, Teng; Huang, Qixing; Zhou, Bing; Tu, Zhengzhong; Ye, Shan; Kang, Yuhao

Junbo Wang, Haofeng Tan, Bowen Liao, Albert Jiang, Teng Fei, Qixing Huang, Bing Zhou, Zhengzhong Tu, Shan Ye, Yuhao Kang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 32659-32670

Abstract

Recent audio-to-image models have shown impressive performance in generating images of specific objects conditioned on their corresponding sounds. However, these models fail to reconstruct real-world landscapes conditioned on acoustic environments. To address this challenge, we present Geo-contextual Soundscape-to-Landscape (GeoS2L) generation, a novel and practically significant task that aims to synthesize geographically realistic landscape images from environmental soundscapes. To support this task, we construct two large-scale geo-contextual multi-modal datasets, SoundingSVI and SonicUrban, which pair diverse environmental soundscapes with real-world landscape images. We further propose SounDiT, a diffusion transformer (DiT)-based model that incorporates acoustic environments and geo-contextual scene conditioning to synthesize geographically coherent landscape images. Furthermore, we propose the Place Similarity Score (PSS), a practically-informed geo-contextual evaluation framework to measure consistency between input soundscapes and generated landscape images. Extensive experiments demonstrate that SounDiT outperforms existing baselines in the GeoS2L, while the PSS effectively captures multi-level generation consistency across element, scene,and human perception.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wang_2026_CVPR, author = {Wang, Junbo and Tan, Haofeng and Liao, Bowen and Jiang, Albert and Fei, Teng and Huang, Qixing and Zhou, Bing and Tu, Zhengzhong and Ye, Shan and Kang, Yuhao}, title = {SounDiT: Geo-Contextual Soundscape-to-Landscape Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {32659-32670} }