On Learning Association of Sound Source and Visual Scenes

Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, In So Kweon; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018, pp. 2508-2509

Abstract


The sight (vision) and hearing (audition) senses are the most important sources that humans use to understand their surroundings. Visual events are typically associated with sounds and they are combined. Naturally, videos and their corresponding sounds also come together in a synchronized way. Given a plenty of video and sound clip pairs, can a machine model learn to associate the sound with visual scene to reveal the sound source location without any supervision in a way similar to human perception to localize sound sources in visual scenes? In this paper, we are interested in exploring whether computational models can learn the spatial correspondence between visual and audio information by leveraging the correlation between visuals and sound based on simply watching and listening to videos in unsupervised way.

Related Material


[pdf]
[bibtex]
@InProceedings{Senocak_2018_CVPR_Workshops,
author = {Senocak, Arda and Oh, Tae-Hyun and Kim, Junsik and Yang, Ming-Hsuan and So Kweon, In},
title = {On Learning Association of Sound Source and Visual Scenes},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2018}
}