Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation

He, Shuting; Ding, Henghui

Shuting He, Henghui Ding; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13332-13341

Abstract

Referring video segmentation relies on natural language expressions to identify and segment objects often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level mixing up static image-level cues with temporal motion cues. However image-level features cannot well comprehend motion cues in sentences and static cues are not crucial for temporal perception. In fact static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work we propose to decouple video-level referring expression understanding into static and motion perception with a specific emphasis on enhancing temporal comprehension. Firstly we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role alleviating the issue of sentence embeddings overlooking motion cues. Secondly we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets including a remarkable 9.2% J&F improvement on the challenging MeViS dataset.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{He_2024_CVPR, author = {He, Shuting and Ding, Henghui}, title = {Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13332-13341} }