Generating Diverse and Descriptive Image Captions Using Visual Paraphrases

Lixin Liu, Jiajun Tang, Xiaojun Wan, Zongming Guo; The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 4240-4249


Recently there has been significant progress in image captioning with the help of deep learning. However, captions generated by current state-of-the-art models are still far from satisfactory, despite high scores in terms of conventional metrics such as BLEU and CIDEr. Human-written captions are diverse, informative and precise, but machine-generated captions seem to be simple, vague and dull. In this paper, aimed at improving diversity and descriptiveness characteristics of generated image captions, we propose a model utilizing visual paraphrases (different sentences describing the same image) in captioning datasets. We explore different strategies to select useful visual paraphrase pairs for training by designing a variety of scoring functions. Our model consists of two decoding stages, where a preliminary caption is generated in the first stage and then paraphrased into a more diverse and descriptive caption in the second stage. Extensive experiments are conducted on the benchmark MS COCO dataset, with automatic evaluation and human evaluation results verifying the effectiveness of our model.

Related Material

[pdf] [supp]
author = {Liu, Lixin and Tang, Jiajun and Wan, Xiaojun and Guo, Zongming},
title = {Generating Diverse and Descriptive Image Captions Using Visual Paraphrases},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}