Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound

Hyeonggon Ryu, Joon Son Chung, David Harwath; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 38606-38615

Abstract


Many audio-visual learning methods have focused on aligning audio and visual information, either through semantic or temporal correspondence. However, most of these works have utilized monaural audio, which does not contain information about the spatial location of the sound source. In contrast, humans and other animals utilize binaural hearing to perceive this spatial information. Combining spatial sound and visual perception enables powerful high-level reasoning: for example, a person looking for their phone may hear the ringing sound coming from a backpack sitting on a table, and quickly infer that the missing phone is inside the backpack. In this paper, we investigate the problem of Audio-Visual Spatial Reasoning. We design a spatial audio-visual question answering dataset to cover scenarios where semantic correspondence between audio and visual signals is absent but spatial alignment exists, as well as cases with multiple audio-visual semantic correspondences that require spatial reasoning to disambiguate. We propose a model that learns spatial comprehension across the audio and vision modalities by connecting them with a large language model and experimentally demonstrate that spatial sound perception is an essential part of our task.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ryu_2026_CVPR, author = {Ryu, Hyeonggon and Chung, Joon Son and Harwath, David}, title = {Hear you are: Teaching LLMs Spatial Reasoning with Vision and Spatial Sound}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {38606-38615} }