Joint Visual and Audio Learning for Video Highlight Detection

Taivanbat Badamdorj, Mrigank Rochan, Yang Wang, Li Cheng; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8127-8137

Abstract


In video highlight detection, the goal is to identify the interesting moments within an unedited video. Although the audio component of the video provides important cues for highlight detection, the majority of existing efforts focus almost exclusively on the visual component. In this paper, we argue that both audio and visual components of a video should be modeled jointly to retrieve its best moments. To this end, we propose an audio-visual network for video highlight detection. At the core of our approach lies a bimodal attention mechanism, which captures the interaction between the audio and visual components of a video, and produces fused representations to facilitate highlight detection. Furthermore, we introduce a noise sentinel technique to adaptively discount a noisy visual or audio modality. Empirical evaluations on two benchmark datasets demonstrate the superior performance of our approach over the state-of-the-art methods.

Related Material


[pdf]
[bibtex]
@InProceedings{Badamdorj_2021_ICCV, author = {Badamdorj, Taivanbat and Rochan, Mrigank and Wang, Yang and Cheng, Li}, title = {Joint Visual and Audio Learning for Video Highlight Detection}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {8127-8137} }