Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion

Qinghao Ye, Xiyue Shen, Yuan Gao, Zirui Wang, Qi Bi, Ping Li, Guang Yang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 7950-7959

Abstract


Video highlight detection plays an increasingly important role in social media content filtering, however, it remains highly challenging to develop automated video highlight detection methods because of the lack of temporal annotations (i.e., where the highlight moments are in long videos) for supervised learning. In this paper, we propose a novel weakly supervised method that can learn to detect highlights by mining video characteristics with video level annotations (topic tags) only. Particularly, we exploit audio-visual features to enhance video representation and take temporal cues into account for improving detection performance. Our contributions are threefold: 1) we propose an audio-visual tensor fusion mechanism that efficiently models the complex association between two modalities while reducing the gap of the heterogeneity between the two modalities; 2) we introduce a novel hierarchical temporal context encoder to embed local temporal clues in between neighboring segments; 3) finally, we alleviate the gradient vanishing problem theoretically during model optimization with attention-gated instance aggregation. Extensive experiments on two benchmark datasets (YouTube Highlights and TVSum) have demonstrated our method outperforms other state-of-the-art methods with remarkable improvements.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ye_2021_ICCV, author = {Ye, Qinghao and Shen, Xiyue and Gao, Yuan and Wang, Zirui and Bi, Qi and Li, Ping and Yang, Guang}, title = {Temporal Cue Guided Video Highlight Detection With Low-Rank Audio-Visual Fusion}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {7950-7959} }