Audio-Visual Segmentation via Unlabeled Frame Exploitation

Jinxiang Liu, Yikun Liu, Fei Zhang, Chen Ju, Ya Zhang, Yanfeng Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26328-26339

Abstract


Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS we explicitly divide them into two categories based on their temporal characteristics i.e. neighboring frame (NF) and distant frame (DF). NFs temporally adjacent to the labeled frame often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs DFs have long temporal distances from the labeled frame which share semantic-similar objects with appearance variations. Considering their unique characteristics we propose a versatile framework that effectively leverages them to tackle AVS. Specifically for NFs we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method unleashing the power of the abundant unlabeled frames.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Liu_2024_CVPR, author = {Liu, Jinxiang and Liu, Yikun and Zhang, Fei and Ju, Chen and Zhang, Ya and Wang, Yanfeng}, title = {Audio-Visual Segmentation via Unlabeled Frame Exploitation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26328-26339} }