Learning Spatiotemporal 3D Convolution with Video Order Self-Supervision

Tomoyuki Suzuki, Takahiro Itazuri, Kensho Hara, Hirokatsu Kataoka; Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0-0

Abstract


The purpose of this work is to explore self-supervised learning (SSL) strategy to capture a better feature with spatiotemporal 3D convolution. Although one of the next frontier in video recognition must be spatiotemporal 3D CNN, the convergence of the 3D convolutions is really difficult because of their enormous parameters or missing temporal(motion) feature. One of the effective solutions is to collect a 105-order video database such as Kinetics/Moments in Time. However, this is not an efficient with burden of manual annotations. In the paper, we train 3D CNN on wrong video-sequence detection tasks in a self-supervised manner (without any manual annotation). The shuffling and verification of consecutive video-frame-order is effective for 3D CNN to capture temporal feature and get a good start point of parameters to be fine-tuned. In the experimental section, we verify that our pretrained 3D CNN on wrong clip detection improves the level of performance on UCF101 (+3.99% better than baseline, namely training 3D convolution from scratch).

Related Material


[pdf]
[bibtex]
@InProceedings{Suzuki_2018_ECCV_Workshops,
author = {Suzuki, Tomoyuki and Itazuri, Takahiro and Hara, Kensho and Kataoka, Hirokatsu},
title = {Learning Spatiotemporal 3D Convolution with Video Order Self-Supervision},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
month = {September},
year = {2018}
}