Exploring the Role of Audio in Video Captioning

Yuhan Shen, Linjie Yang, Longyin Wen, Haichao Yu, Ehsan Elhamifar, Heng Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2090-2100

Abstract


Recent focus in video captioning has been on designing architectures that can consume both video and text modalities and using large-scale video datasets with text transcripts for pre-training such as HowTo100M. Though these approaches have achieved significant improvement the audio modality is often ignored in video captioning. In this work we present an audio-visual framework which aims to fully exploit the potential of the audio modality for captioning. Instead of relying on text transcripts extracted via automatic speech recognition (ASR) we argue that learning with raw audio signals can be more beneficial as audio has additional information including acoustic events speaker identity etc. Our contributions are twofold. First we observed that the model overspecializes to the audio modality when pre-training with both video and audio modality since the ground truth (i.e. text transcripts) can be solely predicted using audio. We proposed a Modality Balanced Pre-training (MBP) loss to mitigate this issue and significantly improve the performance on downstream tasks. Second we slice and dice different design choices of the cross-modal module which may become an information bottleneck and generate inferior results. We proposed new local-global fusion mechanisms to improve information exchange across audio and video. We demonstrate significant improvements by leveraging the audio modality on four datasets and even outperform the state of the art on some metrics without relying on the text modality as the input.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Shen_2024_CVPR, author = {Shen, Yuhan and Yang, Linjie and Wen, Longyin and Yu, Haichao and Elhamifar, Ehsan and Wang, Heng}, title = {Exploring the Role of Audio in Video Captioning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2090-2100} }