Show Think and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning

Kim, Byoungjip; Hwang, Dasol; Cho, Sungjun; Jang, Youngsoo; Lee, Honglak; Lee, Moontae

Byoungjip Kim, Dasol Hwang, Sungjun Cho, Youngsoo Jang, Honglak Lee, Moontae Lee; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1808-1817

Abstract

Large language models (LLMs) have achieved a great success in natural language processing and have a significant potential for multi-modal applications. Despite the surprising zero-shot or few-shot ability it is also required to effectively fine-tune pre-trained language models for specific downstream tasks. In this paper we introduce CaptionT5 a video captioning model that fine-tunes T5 towards understanding videos and generating descriptive captions. To generate a more corespondent caption CaptionT5 introduces thought-augmented fine-tuning for video captioning in which a pre-trained language model is fine-tuned on thought-augmented video inputs. This resembles the process that human see a video think of visual concepts such as objects and actions and then tell a correct and natural sentence based on the thoughts. To automatically generate thoughts we propose (1) CLIP-guided thought sampling that samples thoughts based on the similarity in an image-text multimodal embedding space by leveraging CLIP. We also propose (2) CLIP-guided caption ranking during decoding for further performance gains. Through experimentation on VATEX MSRVTT and YC2 datasets we empirically demonstrate that CaptionT5 performs competitively against prior-art video captioning approaches without using encoders specialized for video data. Further experiments show that CaptionT5 is especially effective under small number of sampled video frames.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Kim_2024_CVPR, author = {Kim, Byoungjip and Hwang, Dasol and Cho, Sungjun and Jang, Youngsoo and Lee, Honglak and Lee, Moontae}, title = {Show Think and Tell: Thought-Augmented Fine-Tuning of Large Language Models for Video Captioning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1808-1817} }