Watch Only Once: An End-to-End Video Action Detection Framework

Shoufa Chen, Peize Sun, Enze Xie, Chongjian Ge, Jiannan Wu, Lan Ma, Jiajun Shen, Ping Luo; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8178-8187

Abstract


We propose an end-to-end pipeline, named Watch Once Only (WOO), for video action detection. Current methods either decouple video action detection task into separated stages of actor localization and action classification or train two separated models within one stage. In contrast, our approach solves the actor localization and action classification simultaneously in a unified network. The whole pipeline is significantly simplified by unifying the backbone network and eliminating many hand-crafted components. WOO takes a unified video backbone to simultaneously extract features for actor location and action classification. In addition, we introduce spatial-temporal action embeddings into our framework and design a spatial-temporal fusion module to obtain more discriminative features with richer information, which further boosts the action classification performance. Extensive experiments on AVA and JHMDB datasets show that WOO achieves state-of-the-art performance, while still reduces up to 16.7% GFLOPs compared with existing methods. We hope our work can inspire rethinking the convention of action detection and serve as a solid baseline for end-to-end action detection. Code is available.

Related Material


[pdf]
[bibtex]
@InProceedings{Chen_2021_ICCV, author = {Chen, Shoufa and Sun, Peize and Xie, Enze and Ge, Chongjian and Wu, Jiannan and Ma, Lan and Shen, Jiajun and Luo, Ping}, title = {Watch Only Once: An End-to-End Video Action Detection Framework}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {8178-8187} }