Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description

Chiori Hori, Takaaki Hori, Gordon Wichern, Jue Wang, Teng-Yok Lee, Anoop Cherian, Tim K. Marks; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018, pp. 2528-2531

Abstract


We incorporate audio features, in addition to image and motion features, for video description based on encoder-decoder recurrent neural networks (RNNs). To fuse these modalities, we introduce a multimodal attention model that can selectively utilize features from different modalities for each word in the output description. We apply our new framework for video description using state-of-the-art audio features such as SoundNet and Audio set VGGish, and state-of-the-art image and spatiotemporal features such as I3D. Results confirm that our attention-based multi-modal fusion of audio features with visual features outperforms conventional video description approaches on three datasets.

Related Material


[pdf]
[bibtex]
@InProceedings{Hori_2018_CVPR_Workshops,
author = {Hori, Chiori and Hori, Takaaki and Wichern, Gordon and Wang, Jue and Lee, Teng-Yok and Cherian, Anoop and Marks, Tim K.},
title = {Multimodal Attention for Fusion of Audio and Spatiotemporal Features for Video Description},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2018}
}