Semantic and Sequential Alignment for Referring Video Object Segmentation

Feiyu Pan, Hao Fang, Fangkai Li, Yanyu Xu, Yawei Li, Luca Benini, Xiankai Lu; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 19067-19076

Abstract


Referring video object segmentation (RVOS) seeks to segment the objects within a video referred by linguistic expressions. Existing RVOS solutions follow a "fuse then select" paradigm: establishing semantic correlation between visual and linguistic feature, and performing frame-level query interaction to select the instance mask per frame with instance segmentation module. This paradigm overlooks the challenge of semantic gap between the linguistic descriptor and the video object as well as the underlying clutters in the video. This paper proposes a novel Semantic and Sequential Alignment (SSA) paradigm to handle these challenges. We first insert a lightweight adapter after the vision language model (VLM) to perform the semantic alignment. Then, prior to selecting mask per frame, we exploit the trajectory-to-instance enhancement for each frame via sequential alignment. This paradigm leverages the visual-language alignment inherent in VLM during adaptation and tries to capture global information by ensembling trajectories. This helps understand videos and the corresponding descriptors by mitigating the discrepancy with intricate activity semantics, particularly when facing occlusion or similar interference. SSA demonstrates competitive performance while maintaining fewer learnable parameters.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Pan_2025_CVPR, author = {Pan, Feiyu and Fang, Hao and Li, Fangkai and Xu, Yanyu and Li, Yawei and Benini, Luca and Lu, Xiankai}, title = {Semantic and Sequential Alignment for Referring Video Object Segmentation}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {19067-19076} }