ECO: Efficient Convolutional Network for Online Video Understanding

Mohammadreza Zolfaghari, Kamaljeet Singh, Thomas Brox; Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 695-712

Abstract


The state of the art in video understanding suffers from two problems: (1) The major part of reasoning is performed locally in the video, thus missing important relationships within actions that span several seconds. (2) While there are local methods with fast per-frame processing, the processing of the whole video is not efficient and hampers fast video retrieval or online classification of long-term activities. In this paper, we introduce a network architecture that takes long-term content into account and enables fast per-video processing at the same time. The architecture is based on merging long-term content already in the network rather than in a post-hoc fusion. Together with a sampling strategy, which exploits that neighboring frames are largely redundant, this yields high-quality action classification and video captioning at up to 230 videos per second, where each video can consist of a few hundred frames. The approach achieves competitive performance across all datasets while being 10x to 80x faster than state-of-the-art methods.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Zolfaghari_2018_ECCV,
author = {Zolfaghari, Mohammadreza and Singh, Kamaljeet and Brox, Thomas},
title = {ECO: Efficient Convolutional Network for Online Video Understanding},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}
}