E2E-LOAD: End-to-End Long-form Online Action Detection

Shuqiang Cao, Weixin Luo, Bairui Wang, Wei Zhang, Lin Ma; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 10422-10432


Recently, feature-based methods for Online Action Detection (OAD) have been gaining traction. However, these methods are constrained by their fixed backbone design, which fails to leverage the potential benefits of a trainable backbone. This paper introduces an end-to-end learning network that revises these approaches, incorporating a backbone network design that improves effectiveness and efficiency. Our proposed model utilizes a shared initial spatial model for all frames and maintains an extended sequence cache, which enables low-cost inference. We promote an asymmetric spatiotemporal model that caters to long-form and short-form modeling. Additionally, we propose an innovative and efficient inference mechanism that accelerates extensive spatiotemporal exploration. Through comprehensive ablation studies and experiments, we validate the performance and efficiency of our proposed method. Remarkably, we achieve an end-to-end learning OAD of 17.3 (+12.6) FPS with 72.4% (+1.2%), 90.3% (+0.7%), and 48.1% (+26.0%) mAP on THMOUS'14, TVSeries, and HDD, respectively.

Related Material

[pdf] [supp]
@InProceedings{Cao_2023_ICCV, author = {Cao, Shuqiang and Luo, Weixin and Wang, Bairui and Zhang, Wei and Ma, Lin}, title = {E2E-LOAD: End-to-End Long-form Online Action Detection}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {10422-10432} }