M^3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation

Chen, Zixuan; Li, Jiaxin; Liang, Junxuan; Tan, Liming; Guo, Yejie; Lu, Cewu; Li, Yong-Lu

Zixuan Chen, Jiaxin Li, Junxuan Liang, Liming Tan, Yejie Guo, Cewu Lu, Yong-Lu Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29193-29202

Abstract

Intelligent robots need to interact with diverse objects across various environments. The appearance and state of objects frequently undergo complex transformations depending on the object properties, e.g., phase transitions. However, in the vision community, segmenting dynamic objects with phase transitions is overlooked. In light of this, we introduce the concept of phase in segmentation, which categorizes real-world objects based on their visual characteristics and potential morphological and appearance changes. Then, we present a new benchmark, Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation (M^3-VOS), to verify the ability of models to understand object phases, which consists of 471 high-resolution videos spanning over 10 distinct everyday scenarios. It provides dense instance mask annotations that capture both object phases and their transitions. We evaluate state-of-the-art methods on M^3-VOS, yielding several key insights. Notably, current appearance-based approaches show significant room for improvement when handling objects with phase transitions. The inherent changes in disorder suggest that the predictive performance of the forward entropy-increasing process can be improved through a reverse entropy-reducing process. These findings lead us to propose ReVOS, a new plug-and-play model that improves its performance by reversal refinement. Our data and code will be publicly available at https://zixuan-chen.github.io/M-cube-VOS.github.io/.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Chen_2025_CVPR, author = {Chen, Zixuan and Li, Jiaxin and Liang, Junxuan and Tan, Liming and Guo, Yejie and Lu, Cewu and Li, Yong-Lu}, title = {M{\textasciicircum}3-VOS: Multi-Phase, Multi-Transition, and Multi-Scenery Video Object Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {29193-29202} }