Bright as the Sun: In-depth Analysis of Imagination-driven Image Captioning

Huyen Thi Thanh Tran, Takayuki Okatani; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 4613-4630


Existing studies on image captioning mainly focus on generating "literal" captions based on visual entities in images and their basic properties such as colors and spatial relationships. However, to describe images, humans use not only literal descriptions but also "imagination-driven" descriptions that characterize visual entities by some different entities; they are often more vivid, precise, and visually comprehensible by readers/hearers. Nonetheless, none of the existing studies seriously consider captions of this type. This study presents the first comprehensive analysis of the generation and evaluation of imagination-driven captions. Specifically, we first analyze imagination-driven captions in existing image captioning datasets. Then, we present the comprehensive categorizations of imagination-driven captions and their usage, discussing the (potential) issues with the current image captioning models to generate such captions. Next, compiling these captions extracted from the existing datasets and synthesizing fake captions, we create a dataset named IdC-I and -II. Using this dataset, we examine nine existing metrics of image captioning about how accurately they can evaluate imagination-driven caption generation. Last, we propose a baseline model for imagination-driven captioning. It has a built-in mechanism to select which to generate between literal and imagination-driven captions, which existing image captioning models cannot do. Experimental results demonstrate that our model performs better than six existing models, especially for imagination-driven caption generation. Dataset and code will be publicly available at:

