Actor-Centric Tubelets for Real-Time Activity Detection in Extended Videos
We address the problem of detecting human and vehicle activities in long, untrimmed surveillance videos that capture a large field of view. Most existing activity detection approaches are designed for recognizing atomic human actions performed in the foreground. Therefore, they are not suitable for detecting activities in extended videos, which contain multiple actors performing co-occurring, complex activities with extreme spatio-temporal scale variation. In this paper, we propose a modular, actor-centric framework for real-time activity detection in extended videos. In particular, we decompose an extended video into a collection of smaller actor-centric tubelets of interest. Each tubelet is a video sub-volume associated with an actor and includes adaptive visual context for recognizing the actor's activities. Once these tubelets are extracted via an object-detection-based approach, we are able to detect activities in each tubelet by focusing on the actor situated in its foreground. To accurately detect the activities of a tubelet's actor we take into account the interactions with other detected actors and objects within the tubelet. We encode such interactions with a dynamic visual spatio-temporal graph and process it with a Graph Neural Network that yields context-aware actor representations. We validate our activity detection framework on the MEVA (Multiview Extended Video with Activities) dataset and the ActEV 2021 Sequestered Data Leaderboard and demonstrate its effectiveness in terms of speed and performance.