Diffusion-based Multimodal Video Captioning

Jaakko Kainulainen, Zixin Guo, Jorma Laaksonen; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2820-2837

Abstract


Diffusion-based models have recently demonstrated notable success in various generative tasks involving continuous signals, such as image, video, and audio synthesis. However, their applicability to video captioning has not yet received widespread attention, primarily due to the discrete nature of captions and the complexities of conditional generation across multiple modalities. This paper delves into diffusion-based video captioning and experiments with various modality fusion methods and different modality combinations to assess their impact on the quality of generated captions. The novelty of our proposed MM-Diff-Net is in the use of diffusion models in multimodal video captioning and in the introduction of a number of mid-fusion techniques for that purpose. Additionally, we propose a new input modality: generated description, which is attended to enhance caption quality. Experiments are conducted on four well-established benchmark datasets, YouCook2, MSR-VTT, VATEX, and VALOR-32K, to evaluate the proposed model and fusion methods. The findings indicate that combining all modalities yields the best captions, but the effect of fusion methods varies across datasets. The performance of our proposed model shows the potential of diffusion-based models in video captioning, paving the way for further exploration and future research in the area.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Kainulainen_2024_ACCV, author = {Kainulainen, Jaakko and Guo, Zixin and Laaksonen, Jorma}, title = {Diffusion-based Multimodal Video Captioning}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2820-2837} }