Mix and Localize: Localizing Sound Sources in Mixtures

Xixi Hu, Ziyang Chen, Andrew Owens; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10483-10492

Abstract


We present a method for simultaneously localizing multiple sound sources within a visual scene. This task requires a model to both group a sound mixture into individual sources, and to associate them with a visual signal. Our method jointly solves both tasks at once, using a formulation inspired by the contrastive random walk of Jabri et al. We create a graph in which images and separated sounds each correspond to nodes, and train a random walker to transition between nodes from different modalities with high return probability. The transition probabilities for this walk are determined by an audio-visual similarity metric that is learned by our model. We show through experiments with musical instruments and human speech that our model can successfully localize multiple sounds, outperforming other self-supervised methods.

Related Material


[pdf]
[bibtex]
@InProceedings{Hu_2022_CVPR, author = {Hu, Xixi and Chen, Ziyang and Owens, Andrew}, title = {Mix and Localize: Localizing Sound Sources in Mixtures}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {10483-10492} }