Video Representation Learning Through Prediction for Online Object Detection

Masato Fujitake, Akihiro Sugimoto; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2022, pp. 530-539

Abstract


We present a video representation learning framework for real-time video object detection. Our approach is based on the interesting observation that a powerful prior knowledge of video context helps to improve object recognition, and it can be acquired via learning video representations through stochastic video prediction. Our proposed framework utilizes the stochastic video prediction into object detection so that we first acquire a prior knowledge of videos to have video representations and then adjust them to object detection to improve the accuracy. We validate our proposed method on ImageNet VID and VisDrone-VID2019 datasets to demonstrate the effectiveness of video representation learning via future video prediction. In particular, our extensive experiments on ImageNet VID show that our approach achieves 73.1% mAP at 54 fps with ResNet-50 on commercial GPUs.

Related Material


[pdf]
[bibtex]
@InProceedings{Fujitake_2022_WACV, author = {Fujitake, Masato and Sugimoto, Akihiro}, title = {Video Representation Learning Through Prediction for Online Object Detection}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2022}, pages = {530-539} }