Action Detection via an Image Diffusion Process

Lin Geng Foo, Tianjiao Li, Hossein Rahmani, Jun Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18351-18361

Abstract


Action detection aims to localize the starting and ending points of action instances in untrimmed videos and predict the classes of those instances. In this paper we make the observation that the outputs of the action detection task can be formulated as images. Thus from a novel perspective we tackle action detection via a three-image generation process to generate starting point ending point and action-class predictions as images via our proposed Action Detection Image Diffusion (ADI-Diff) framework. Furthermore since our images differ from natural images and exhibit special properties we further explore a Discrete Action-Detection Diffusion Process and a Row-Column Transformer design to better handle their processing. Our ADI-Diff framework achieves state-of-the-art results on two widely-used datasets.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Foo_2024_CVPR, author = {Foo, Lin Geng and Li, Tianjiao and Rahmani, Hossein and Liu, Jun}, title = {Action Detection via an Image Diffusion Process}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18351-18361} }