MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation

Roy Miles, Mehmet Kerim Yucel, Bruno Manganelli, Albert Saà-Garriga; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 10480-10490

Abstract


This paper tackles the problem of semi-supervised video object segmentation on resource-constrained devices, such as mobile phones. We formulate this problem as a distillation task, whereby we demonstrate that small space-time-memory networks with finite memory can achieve competitive results with state of the art, but at a fraction of the computational cost (32 milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a theoretically grounded framework that unifies knowledge distillation with supervised contrastive representation learning. These models are able to jointly benefit from both pixel-wise contrastive learning and distillation from a pre-trained teacher. We validate this loss by achieving competitive J&F to state of the art on both the standard DAVIS and YouTube benchmarks, despite running up to x5 faster, and with x32 fewer parameters.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Miles_2023_CVPR, author = {Miles, Roy and Yucel, Mehmet Kerim and Manganelli, Bruno and Sa\`a-Garriga, Albert}, title = {MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2023}, pages = {10480-10490} }