Multi-Motion and Appearance Self-Supervised Moving Object Detection
In this work, we consider the problem of self-supervised Moving Object Detection (MOD) in video, where no ground truth is involved in both training and inference phases. Recently, an adversarial learning framework is proposed to leverage inherent temporal information for MOD. While showing great promising results, it uses single scale temporal information and may meet problems when dealing with a deformable object under multi-scale motion in different parts. Additional challenges can arise from the moving camera, which results in the failure of the motion independence hypothesis and locally independent background motion. To deal with these problems, we propose a Multi-motion and Appearance Self-supervised Network (MASNet) to introduce multi-scale motion information and appearance information of scene for MOD. In particular, a moving object, especially the deformable, usually consists of moving regions at various temporal scales. Introducing multi-scale motion can aggregate these regions to form a more complete detection. Appearance information can serve as another cue for MOD when the motion independence is not reliable and for removing false detection in background caused by locally independent background motion. To encode multi-scale motion and appearance, in MASNet we respectively design a multi-branch flow encoding module and an image inpainter module. The proposed modules and MASNet are extensively evaluated on the DAVIS dataset to demonstrate the effectiveness and superiority to state-of-the-art self-supervised methods.