Spatio-Temporal Activity Detection via Joint Optimization of Spatial and Temporal Localization

Md Atiqur Rahman, Robert Laganière; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2024, pp. 242-250

Abstract


In this article, we address the problem of spatio-temporal activity detection which requires classifying as well as localizing human activities both in space and time from videos. To this end, we propose a novel single-stage and end-to-end trainable deep learning framework that can jointly optimize spatial and temporal localization of activities. Leveraging shared spatio-temporal feature maps, the proposed framework performs actor detection, activity tube building, as well as temporal localization of activities, all within a single network. The proposed framework outperforms the current state-of-the-art methods in spatio-temporal activity detection on the challenging UCF101-24 benchmark. By utilizing solely RGB input, it achieves a video-mAP of 60.1%, and further pushes the bar to 61.3% when incorporating both RGB and FLOW inputs. Moreover, it attains a highly competitive frame-mAP of 74.9%.

Related Material


[pdf]
[bibtex]
@InProceedings{Rahman_2024_WACV, author = {Rahman, Md Atiqur and Lagani\`ere, Robert}, title = {Spatio-Temporal Activity Detection via Joint Optimization of Spatial and Temporal Localization}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2024}, pages = {242-250} }