Embracing Uncertainty: Decoupling and De-Bias for Robust Temporal Grounding

Zhou, Hao; Zhang, Chongyang; Luo, Yan; Chen, Yanjun; Hu, Chuanping

Hao Zhou, Chongyang Zhang, Yan Luo, Yanjun Chen, Chuanping Hu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 8445-8454

Abstract

Temporal grounding aims to localize temporal boundaries within untrimmed videos by language queries, but it faces the challenge of two types of inevitable human uncertainties: query uncertainty and label uncertainty. The two uncertainties stem from human subjectivity, leading to limited generalization ability of temporal grounding. In this work, we propose a novel DeNet (Decoupling and De-bias) to embrace human uncertainty: Decoupling -- We explicitly disentangle each query into a relation feature and a modified feature. The relation feature, which is mainly based on skeleton-like words (including nouns and verbs), aims to extract basic and consistent information in the presence of query uncertainty. Meanwhile, modified feature assigned with style-like words (including adjectives, adverbs, etc) represents the subjective information, and thus brings personalized predictions; De-bias -- We propose a de-bias mechanism to generate diverse predictions, aim to alleviate the bias caused by single-style annotations in the presence of label uncertainty. Moreover, we put forward new multi-label metrics to diversify the performance evaluation. Extensive experiments show that our approach is more effective and robust than state-of-the-arts on Charades-STA and ActivityNet Captions datasets.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Zhou_2021_CVPR, author = {Zhou, Hao and Zhang, Chongyang and Luo, Yan and Chen, Yanjun and Hu, Chuanping}, title = {Embracing Uncertainty: Decoupling and De-Bias for Robust Temporal Grounding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2021}, pages = {8445-8454} }