Exploring Spatial-temporal Instance Relationships In an Intermediate Domain For Image-to-video Object Detection

Zihan Wen, Jin Chen, Xinxiao Wu; Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, 2022, pp. 354-369

Abstract


Image-to-video object detection leverages annotated images to help detect objects in unannotated videos, so as to break the heavy dependency on the expensive annotation of large-scale video frames. This task is extremely challenging due to the serious domain discrepancy between images and video frames caused by appearance variance and motion blur. Previous methods perform both image-level and instance-level alignments to reduce the domain discrepancy, but the existing false instance alignments may limit their performance in real scenarios. We propose a novel spatial-temporal graph to model the contextual relationships between instances to alleviate the false alignments. Through message propagation over the graph, the visual information from the spatial and temporal neighboring object proposals are adaptively aggregated to enhance the current instance representation. Moreover, to adapt the source-biased decision boundary to the target data, we generate an intermediate domain between images and frames. It is worth mentioning that our method can be easily applied as a plug-and-play component to other image-to-video object detection models based on the instance alignment. Experiments on several datasets demonstrate the effectiveness of our method. Code will be available at: https://github.com/wenzihan/STMP.

Related Material


[pdf]
[bibtex]
@InProceedings{Wen_2022_ACCV, author = {Wen, Zihan and Chen, Jin and Wu, Xinxiao}, title = {Exploring Spatial-temporal Instance Relationships In an Intermediate Domain For Image-to-video Object Detection}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops}, month = {December}, year = {2022}, pages = {354-369} }