Unsupervised Sounding Object Localization With Bottom-Up and Top-Down Attention

Jiayin Shi, Chao Ma; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 1737-1746

Abstract


Learning to localize sounding objects in visual scenes without manual annotations has drawn increasing attention recently. In this paper, we propose an unsupervised sounding object localization algorithm by using bottom-up and top-down attention in visual scenes. The bottom-up attention module generates an objectness confidence map, while the top-down attention draws the similarity between sound and visual regions. Moreover, we propose a bottom-up attention loss function, which models the correlation relationship between bottom-up and top-down attention. Extensive experimental results demonstrate that our proposed unsupervised method significantly advances the state-of-the-art unsupervised methods. The source code is available at https://github.com/VISION-SJTU/usol/.

Related Material


[pdf]
[bibtex]
@InProceedings{Shi_2022_WACV, author = {Shi, Jiayin and Ma, Chao}, title = {Unsupervised Sounding Object Localization With Bottom-Up and Top-Down Attention}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2022}, pages = {1737-1746} }