UVIS: Unsupervised Video Instance Segmentation

Shuaiyi Huang, Saksham Suri, Kamal Gupta, Sai Saketh Rambhatla, Ser-Nam Lim, Abhinav Shrivastava; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2682-2692

Abstract


Video instance segmentation requires classifying segmenting and tracking every object across video frames. Unlike existing approaches that rely on masks boxes or category labels we propose UVIS a novel Unsupervised Video Instance Segmentation (UVIS) framework that can perform video instance segmentation without any video annotations or dense label-based pretraining. Our key insight comes from leveraging the dense shape prior from the self-supervised vision foundation model DINO and the open set recognition ability from the image-caption supervised vision-language model CLIP. Our UVIS framework consists of three essential steps: frame-level pseudo-label generation transformer-based VIS model training and query-based tracking. To improve the quality of VIS predictions in the unsupervised setup we introduce a dual-memory design. This design includes a semantic memory bank for generating accurate pseudo-labels and a tracking memory bank for maintaining temporal consistency in object tracks. We evaluate our approach on three standard VIS benchmarks namely YoutubeVIS-2019 YoutubeVIS-2021 and Occluded VIS. Our UVIS achieves 21.1 AP on YoutubeVIS-2019 without any video annotations or dense pretraining demonstrating the potential of our unsupervised VIS framework.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Huang_2024_CVPR, author = {Huang, Shuaiyi and Suri, Saksham and Gupta, Kamal and Rambhatla, Sai Saketh and Lim, Ser-Nam and Shrivastava, Abhinav}, title = {UVIS: Unsupervised Video Instance Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2682-2692} }