Weakly-Supervised Action Localization With Background Modeling

Phuc Xuan Nguyen, Deva Ramanan, Charless C. Fowlkes; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5502-5511


We describe a latent approach that learns to detect actions in long sequences given training videos with only whole-video class labels. Our approach makes use of two innovations to attention-modeling in weakly-supervised learning. First, and most notably, our framework uses an attention model to extract both foreground and background frames who's appearance is explicitly modeled. Most prior work ignores the background, but we show that modeling it allows our system to learn a richer notions of actions and their temporal extents. Second, we combine bottom-up, class-agnostic attention modules with top-down, class-specific activation maps, using the latter as form of self-supervision for the former. Doing so allows our model to learn a more accurate model of attention without explicit temporal supervision. These modifications lead to 10% AP@IoU=0.5 improvement over existing systems on THUMOS14. Our proposed weakly-supervised system outperforms the recent state-of-the-art by at least 4.3% AP@IoU=0.5. Finally, we demonstrate that weakly-supervised learning can be used to aggressively scale-up learning to in-the-wild, uncurated Instagram videos (where relevant frames and videos are automatically selected through attentional processing). This allows our weakly supervised approach to even outperform fully-supervised methods for action detection at some overlap thresholds.

Related Material

[pdf] [supp]
author = {Nguyen, Phuc Xuan and Ramanan, Deva and Fowlkes, Charless C.},
title = {Weakly-Supervised Action Localization With Background Modeling},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}