Large-Scale Bidirectional Training for Zero-Shot Image Captioning

Taehoon Kim, Mark Marsden, Pyunghwan Ahn, Sangyun Kim, Sihaeng Lee, Alessandra Sala, Seung Hwan Kim; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7373-7383

Abstract


When trained on large-scale datasets image captioning models can understand the content of images from a general domain but often fail to generate accurate detailed captions. To improve performance pretraining-and-finetuning has been a key strategy for image captioning. However we find that large-scale bidirectional training between image and text enables zero-shot image captioning. In this paper we introduce Bidirectional Image Text Training in largER Scale BITTERS an efficient training and inference framework for zero-shot image captioning. We also propose a new evaluation benchmark which comprises of high quality datasets and an extensive set of metrics to properly evaluate zero-shot captioning accuracy and societal bias. We additionally provide an efficient finetuning approach for keyword extraction. We show that careful selection of large-scale training set and model architecture is the key to achieving zero-shot image captioning.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Kim_2024_CVPR, author = {Kim, Taehoon and Marsden, Mark and Ahn, Pyunghwan and Kim, Sangyun and Lee, Sihaeng and Sala, Alessandra and Kim, Seung Hwan}, title = {Large-Scale Bidirectional Training for Zero-Shot Image Captioning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7373-7383} }