- [pdf] [code]
Video Object Segmentation via Structural Feature Reconfiguration
Recent memory-based methods have made significant progress for semi-supervised video object segmentation, by explicitly modeling the semantic correspondences between the target frame and the historical ones. However, the indiscriminate acceptance of historical frames into the memory bank and the lack of fine-grained extraction for target objects may incur high latency and information redundancy in these approaches. In this paper, we circumvent the challenges by developing a Structural Feature Reconfiguration Network (SFRNet). The proposed SFRNet consists of two core sub-modules, which are the Global-temporal Attention Module (GAM) and the Local-spatial Attention Module (LAM). In GAM, we exploit self-attention-based encoders to capture the target objects' temporal context from historical frames. The LAM then reconfigures features with the current frame's spatial structural prior, which reinforces the objectness of foreground objects and suppresses the interference from background regions. By doing so, our model reduces the reliance on the large memory bank containing redundant historical frames, while instead effectively segmenting video objects with spatio-temporal context aggregated from a small set of key frames. We conduct extensive experiments with benchmark datasets, and the results demonstrate our method's favorable performance against the state-of-the-art approaches. The model and code will be publicly available.