Rethinking the Form of Latent States in Image Captioning

Bo Dai, Deming Ye, Dahua Lin ; Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 282-298


Recurrent Neural Networks (RNN) or their variants, e.g. GRU and LSTM, have been widely adopted for image captioning. In an RNN, the production of a caption is driven by a sequence of latent states. Existing captioning models usually represent latent states as vectors, taking this practice for granted. In our work, we rethink this choice and study an alternative formulation, namely using two-dimensional maps to encode latent states and convolution for state transformation. This is motivated by the curiosity about a question: how are the spatial structures in the latent states related to the resultant captions? Our study on MSCOCO [1] and Flickr30k [2] leads to two significant observations. First, the formulation with 2D states is generally more effective in captioning, consistently achieving higher performance with comparable parameter sizes. Second, 2D states preserve spatial locality. Taking advantage of this, we derive a simple scheme that can visually reveal the internal dynamics in the process of caption generation, as well as the connections between input visual domain and output linguistic domain.

Related Material

author = {Dai, Bo and Ye, Deming and Lin, Dahua},
title = {Rethinking the Form of Latent States in Image Captioning},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}