Show, Conceive and Tell: Image Captioning with Prospective Linguistic Information

Yiqing Huang, Jiansheng Chen; Proceedings of the Asian Conference on Computer Vision (ACCV), 2020


Attention based encoder-decoder models have achieved competitive performances in image captioning. However, these models usually follow the auto-regressive way during inference, meaning that only the previously generated words, namely the explored linguistic information, can be utilized for caption generation.Intuitively, enabling the model to conceive the prospective linguistic information contained in the words to be generated can be beneficial for further improving the captioning results. Consequently, we devise a novel Prospective information guided LSTM (Pro-LSTM) model, to exploit both prospective and explored information to boost captioning. For each image, we first draft a coarse caption which roughly describes the whole image contents. At each time step, we mine the prospective and explored information from the coarse caption. These two kinds of information are further utilized by a Prospective information guided Attention (ProA) module to guide our model to comprehensively utilize the visual feature from a semantically global perspective. We also propose an Attentive Attribute Detector (AAD) which refines the object features to predict the image attributes more precisely. This further improves the semantic quality of the generated caption. Thanks to the prospective information and more accurate attributes, the Pro-LSTM model achieves state-of-the-art performances on the MSCOCO dataset with a 129.5 CIDEr-D.

Related Material

[pdf] [supp]
@InProceedings{Huang_2020_ACCV, author = {Huang, Yiqing and Chen, Jiansheng}, title = {Show, Conceive and Tell: Image Captioning with Prospective Linguistic Information}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {November}, year = {2020} }