End-to-End Spatio-Temporal Action Localisation with Video Transformers

Alexey A. Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lucic, Cordelia Schmid, Anurag Arnab; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18373-18383

Abstract


The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks. We propose a fully end-to-end transformer based model that directly ingests an input video and outputs tubelets -- a sequence of bounding boxes and the action classes at each frame. Our flexible model can be trained with either sparse bounding-box supervision on individual frames or full tubelet annotations. And in both cases it predicts coherent tubelets as the output. Moreover our end-to-end model requires no additional pre-processing in the form of proposals or post-processing in terms of non-maximal suppression. We perform extensive ablation experiments and significantly advance the state-of-the-art on five different spatio-temporal action localisation benchmarks with both sparse keyframes and full tubelet annotations.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Gritsenko_2024_CVPR, author = {Gritsenko, Alexey A. and Xiong, Xuehan and Djolonga, Josip and Dehghani, Mostafa and Sun, Chen and Lucic, Mario and Schmid, Cordelia and Arnab, Anurag}, title = {End-to-End Spatio-Temporal Action Localisation with Video Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18373-18383} }