AViON4D: Audio-Visual Open-Vocabulary 4D Egocentric Scene Understanding

Ballester, Irene; Hermosilla, Pedro; Lin, Wei; Glass, James R.; Mirza, M. Jehanzeb; Kampel, Martin

Irene Ballester, Pedro Hermosilla, Wei Lin, James R. Glass, M. Jehanzeb Mirza, Martin Kampel; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 8355-8365

Abstract

Egocentric videos offer rich, first-person perspectives on human interactions with the world, enabling applications such as activity recognition and assistive technologies. While recent advances in neural radiance fields (NeRFs) and semantic distillation from vision-language models have enabled open-vocabulary 3D scene understanding, these methods are limited to static scenes and cannot encode the temporal semantics and dynamic human-world interactions essential for egocentric spatio-temporal understanding. We introduce AViON4D, the first multimodal NeRF-based method that extends semantic NeRFs from 3D to 4D for egocentric scenes. AViON4D lifts video features into 4D representations, replacing static image embeddings with temporally-aware features that encode action dynamics. To handle the extreme motion characteristic of egocentric video, we design an object-centric feature extraction strategy that maintains spatial and temporal coherence by tracking objects across frames. Furthermore, AViON4D incorporates audio-language features as a complementary temporal signal via late fusion, helping to disambiguate visually similar actions and refine temporal boundaries through distinctive acoustic cues. Extensive experiments on two complementary 4D open-vocabulary tasks demonstrate that AViON4D outperforms existing single-modal and static approaches by up to +6.0% for action localization and +14.90% for action segmentation. Moreover, for the first time, we show that audio cues from egocentric recordings not only enhance performance but also narrow the gap between single-frame and video-based models, offering a more robust and generalizable solution for 4D scene understanding in egocentric settings. Our code is available at https://github.com/iballester/AViON4D.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Ballester_2026_CVPR, author = {Ballester, Irene and Hermosilla, Pedro and Lin, Wei and Glass, James R. and Mirza, M. Jehanzeb and Kampel, Martin}, title = {AViON4D: Audio-Visual Open-Vocabulary 4D Egocentric Scene Understanding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {8355-8365} }