Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers

Xin Hu, Kai Li, Deep Patel, Erik Kruus, Martin Renqiang Min, Zhengming Ding; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2704-2713

Abstract


Weakly Supervised Temporal Action Localization (WSTAL) aims to jointly localize and classify action segments in untrimmed videos with only video level annotations. To leverage video level annotations most existing methods adopt the multiple instance learning paradigm where frame/snippet level action predictions are first produced and then aggregated to form a video-level prediction. Although there are trials to improve snippet-level predictions by modeling temporal relationships we argue that those implementations have not sufficiently exploited such information. In this paper we propose Multi Modal Plateau Transformers (M2PT) for WSTAL by simultaneously exploiting temporal relationships among snippets complementary information across data modalities and temporal coherence among consecutive snippets. Specifically M2PT explores a dual Transformer architecture for RGB and optical flow modalities which models intra modality temporal relationship with a self attention mechanism and inter modality temporal relationship with a cross attention mechanism. To capture the temporal coherence that consecutive snippets are supposed to be assigned with the same action M2PT deploys a Plateau model to refine the temporal localization of action segments. Experimental results on popular benchmarks demonstrate that our proposed M2PT achieves state of the art performance.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Hu_2024_CVPR, author = {Hu, Xin and Li, Kai and Patel, Deep and Kruus, Erik and Min, Martin Renqiang and Ding, Zhengming}, title = {Weakly-Supervised Temporal Action Localization with Multi-Modal Plateau Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2704-2713} }