Common Subspace for Model and Similarity: Phrase Learning for Caption Generation From Images

Yoshitaka Ushiku, Masataka Yamaguchi, Yusuke Mukuta, Tatsuya Harada; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 2668-2676

Abstract


Generating captions to describe images is a fundamental problem that combines computer vision and natural language processing. Recent works focus on descriptive phrases, such as "a white dog" to explain the visual composites of an input image. The phrases can not only express objects, attributes, events, and their relations but can also reduce visual complexity. A caption for an input image can be generated by connecting estimated phrases using a grammar model. However, because phrases are combinations of various words, the number of phrases is much larger than the number of single words. Consequently, the accuracy of phrase estimation suffers from too few training samples per phrase. In this paper, we propose a novel phrase-learning method: Common Subspace for Model and Similarity (CoSMoS). In order to overcome the shortage of training samples, CoSMoS obtains a subspace in which (a) all feature vectors associated with the same phrase are mapped as mutually close, (b) classifiers for each phrase are learned, and (c) training samples are shared among co-occurring phrases. Experimental results demonstrate that our system is more accurate than those in earlier work and that the accuracy increases when the dataset from the web increases.

Related Material


[pdf]
[bibtex]
@InProceedings{Ushiku_2015_ICCV,
author = {Ushiku, Yoshitaka and Yamaguchi, Masataka and Mukuta, Yusuke and Harada, Tatsuya},
title = {Common Subspace for Model and Similarity: Phrase Learning for Caption Generation From Images},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
month = {December},
year = {2015}
}