Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training

Kaixuan Lu, Mehmet Onurcan Kaya, Dim P. Papadopoulos; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026, pp. 7387-7397

Abstract


Video Instance Segmentation (VIS) faces significant annotation challenges due to its dual requirements of pixel-level masks and temporal consistency labels. While recent unsupervised methods like VideoCutLER eliminate optical flow dependencies through synthetic data, they remain constrained by the synthetic-to-real domain gap. We present AutoQ-VIS, a novel unsupervised framework that bridges this gap through quality-guided self-training. Our approach establishes a closed-loop system between pseudo-label generation and automatic quality assessment, enabling progressive adaptation from synthetic to real videos. Experiments demonstrate state-of-the-art performance with 52.6 \text AP _ 50 on YouTubeVIS-2019 \texttt val set, surpassing the previous state-of-the-art VideoCutLER by 4.4%, while requiring no human annotations. This demonstrates the viability of quality-aware self-training for unsupervised VIS. We will release the code at https://github.com/wcbup/AutoQ-VIS.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Lu_2026_WACV, author = {Lu, Kaixuan and Kaya, Mehmet Onurcan and Papadopoulos, Dim P.}, title = {Boosting Unsupervised Video Instance Segmentation with Automatic Quality-Guided Self-Training}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {7387-7397} }