Going Deeper Into Embedding Learning for Video Object Segmentation

Zongxin Yang, Peike Li, Qianyu Feng, Yunchao Wei, Yi Yang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 0-0


In this paper, we investigate the principles of consistent training, between given reference and predicted sequence, for better embedding learning of semi-supervised video object segmentation. To accurately segment the target objects given the mask at the first frame, we realize that the expected feature embeddings of any consecutive frames should satisfy the following properties: 1) global consistency in terms of both foreground object(s) and background; 2) robust local consistency under a various object moving rate; 3) environment consistency between the training and inference process; 4) receptive consistency between the receptive fields of network and the variable scales of objects; 5) sampling consistency between foreground and background pixels to avoid training bias. With the principles in mind, we carefully design a simple pipeline to lift both accuracy and efficiency for video object segmentation effectively. With the ResNet-101 as the backbone, our single model achieves a J&F score of 81.0% on the validation set of Youtube-VOS benchmark without any bells and whistles. By applying multi-scale & flip augmentation at the testing stage, the accuracy can be further boosted to 82.4%. Code will be made available.

Related Material

author = {Yang, Zongxin and Li, Peike and Feng, Qianyu and Wei, Yunchao and Yang, Yi},
title = {Going Deeper Into Embedding Learning for Video Object Segmentation},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
month = {Oct},
year = {2019}