SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation

Jie Mei, AJ Piergiovanni, Jenq-Neng Hwang, Wei Li; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2024, pp. 507-517

Abstract


The referring video object segmentation (R-VOS) task requires a model to understand both referring expression and video input. Most recent works are mainly based on an encoder-decoder type of architecture. Although their text and visual encoders can benefit from separately pre-trained backbones, their decoder is trained from scratch on a combination of image/video segmentation datasets. However, pixel-wise annotation with referring expressions is extremely expensive which makes it challenging to further improve the performance. Due to the same reason, current vision-language pre-training works mainly focus on learning general feature representations for image-level or object-level tasks, which may be not optimal for the downstream pixel-level segmentation task. To bridge this gap, we present a general self-supervised language-video pre-training (SLVP) architecture. With the relatively cheap video caption dataset, SLVP can learn pixel-level features by introducing optical flow as the intermediate target. Correspondingly, we propose simple transfer learning models that can reuse pre-trained modules for the downstream R-VOS task. Furthermore, the proposed general SLVP architecture can support either 'language as query' fusion or 'vision as query' fusion. Experiments show the superiority of the under-studied 'vision as query' method which can achieve better performance than the state-of-the-art methods on Ref-Davis17 and Ref-Youtube-VOS benchmarks even with fewer model parameters. We further adopt the challenging VISOR benchmark to the R-VOS task and our SLVP serves as the first strong baseline for R-VOS task on it.

Related Material


[pdf]
[bibtex]
@InProceedings{Mei_2024_WACV, author = {Mei, Jie and Piergiovanni, AJ and Hwang, Jenq-Neng and Li, Wei}, title = {SLVP: Self-Supervised Language-Video Pre-Training for Referring Video Object Segmentation}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2024}, pages = {507-517} }