Self-Supervised Segmentation and Source Separation on Videos

Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, Antonio Torralba; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2019, pp. 0-0

Abstract


Semantic segmentation of images [11, 3] and sound source separation in audio [8, 4, 1] are two important and popular tasks in the computer vision and computational audition communities. Traditional approaches have relied on large, labeled datasets, but recent work has leveraged the natural correspondence between vision and sound to apply supervised learning without explicit labels. In this paper, we develop a neural network model for visual object segmentation and sound source separation that learns from natural videos through self-supervision. The model is an extension of recently proposed work that maps image pixels to sounds [9]. This paper is a workshop edit of Rouditchenko et al. 2019 [5].

Related Material


[pdf]
[bibtex]
@InProceedings{Rouditchenko_2019_CVPR_Workshops,
author = {Rouditchenko, Andrew and Zhao, Hang and Gan, Chuang and McDermott, Josh and Torralba, Antonio},
title = {Self-Supervised Segmentation and Source Separation on Videos},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2019}
}