OpenCity3D: What do Vision-Language Models Know About Urban Environments?

Valentin Bieri, Marco Zamboni, Nicolas Samuel Blumer, Qingxuan Chen, Francis Engelmann; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 5147-5155

Abstract


The rise of 2D vision-language models (VLMs) has enabled new possibilities for language-driven 3D scene understanding tasks. Existing works focus on indoor scenes or autonomous driving scenarios and typically validate against a pre-defined set of semantic object classes. In this work we analyze the capabilities of vision-language models for large-scale urban 3D scene understanding and propose new applications of VLMs that directly operate on aerial 3D reconstructions of cities. In particular we address higher-level 3D scene understanding tasks such as population density building age property prices crime rate and noise pollution. Our analysis reveals surprising zero-shot and few-shot performance of VLMs in urban environments.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Bieri_2025_WACV, author = {Bieri, Valentin and Zamboni, Marco and Blumer, Nicolas Samuel and Chen, Qingxuan and Engelmann, Francis}, title = {OpenCity3D: What do Vision-Language Models Know About Urban Environments?}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5147-5155} }