Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks

Ziyi Liu, Le Wang, Qilin Zhang, Zhanning Gao, Zhenxing Niu, Nanning Zheng, Gang Hua; The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 3899-3908


Weakly-supervised temporal action localization (WS-TAL) is a promising but challenging task with only video-level action categorical labels available during training. Without requiring temporal action boundary annotations in training data, WS-TAL could possibly exploit automatically retrieved video tags as video-level labels. However, such coarse video-level supervision inevitably incurs confusions, especially in untrimmed videos containing multiple action instances. To address this challenge, we propose the Contrast-based Localization EvaluAtioN Network (CleanNet) with our new action proposal evaluator, which provides pseudo-supervision by leveraging the temporal contrast in snippet-level action classification predictions. Essentially, the new action proposal evaluator enforces an additional temporal contrast constraint so that high-evaluation-score action proposals are more likely to coincide with true action instances. Moreover, the new action localization module is an integral part of CleanNet which enables end-to-end training. This is in contrast to many existing WS-TAL methods where action localization is merely a post-processing step. Experiments on THUMOS14 and ActivityNet datasets validate the efficacy of CleanNet against existing state-ofthe- art WS-TAL algorithms.

Related Material

[pdf] [supp]
author = {Liu, Ziyi and Wang, Le and Zhang, Qilin and Gao, Zhanning and Niu, Zhenxing and Zheng, Nanning and Hua, Gang},
title = {Weakly Supervised Temporal Action Localization Through Contrast Based Evaluation Networks},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}