Putting the Object Back into Video Object Segmentation

Ho Kei Cheng, Seoung Wug Oh, Brian Price, Joon-Young Lee, Alexander Schwing; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 3151-3161

Abstract


We present Cutie a video object segmentation (VOS) network with object-level memory reading which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise especially in the presence of distractors resulting in lower performance in more challenging data. In contrast Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt hence Cutie). The object queries act as a high-level summary of the target object while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: hkchengrex.github.io/Cutie

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Cheng_2024_CVPR, author = {Cheng, Ho Kei and Oh, Seoung Wug and Price, Brian and Lee, Joon-Young and Schwing, Alexander}, title = {Putting the Object Back into Video Object Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {3151-3161} }