VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, Boqing Gong; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1811-1820

Abstract


Rich and dense human labeled datasets are the main enabling factor, among others, for the recent exciting work on vision-language understanding. Many seemingly distinct annotations (e.g., semantic segmentation and visual questions answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of MS COCO images. The popularity of MS COCO could strongly correlate those annotations and tasks. Explicitly linking them up, as we envision, can significantly benefit not only individual tasks but also the overarching goal of unified vision-language understand. We present the preliminary work of linking the instance segmentations provided by MS COCO to the questions and answers (QA) in the VQA dataset. We call the collected links visual questions and segmentation answers (VQS). They transfer human supervision between the previously separate tasks, offer more effective leverage to existing problems, and also open the door for new tasks and richer models. We study two applications of the VQS data in this paper: supervised attention for VQA and a novel question-focused semantic segmentation task. For the former, we obtain state-of-the-art results on the VQA real multiple-choice task by simply augmenting multilayer perceptrons with some attention features that are learned by using the segmentation-QA links as explicit supervision. To put the latter in perspective, we study two plausible methods and an oracle upper bound.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Gan_2017_ICCV,
author = {Gan, Chuang and Li, Yandong and Li, Haoxiang and Sun, Chen and Gong, Boqing},
title = {VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2017}
}