-
[pdf]
[supp]
[bibtex]@InProceedings{Zhang_2026_CVPR, author = {Zhang, Wenkang and Yang, Kaicheng and An, Xiang and Li, Qiang and Feng, Ziyong and Yang, Wankou and Deng, Jiankang}, title = {Towards Streaming Referring Video Segmentation via Large Language Model}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {24598-24607} }
Towards Streaming Referring Video Segmentation via Large Language Model
Abstract
Current referring video segmentation methods typically operate in an offline manner, where sparse frames are first selected for image-level referring segmentation, and the resulting masks are then propagated across the video. Although video sampling captures global context, its isolated processing steps not only complicate optimization but also restrict applicability to real-world streaming scenarios. In this paper, we propose a simple but efficient MLLM-based framework StreamingRVOS, which can extend image-level segmentation to video-level via a streaming pipeline without introducing extra parameters. Specifically, we employ a Semantic Embedding Recycling (SER) method to propagate temporal context across frames, enabling the model to perceive semantic representation in the video. Then, we propose an Online Mask Consistency Perception (OMCP) strategy to adaptively invoke the MLLM to re-perceive the current scene and regenerate the semantic embedding. We conduct extensive experiments on multiple downstream datasets to prove the effectiveness of StreamingRVOS. Compared to previous methods, our method achieves excellent performance in referring video segmentation (1B variant improves upon Sa2VA by 19.2 on the MeViS dataset), while operating at an average speed of 7 FPS under streaming inference on 1 xA800 GPU.
Related Material

