Where to Look: Focus Regions for Visual Question Answering

Kevin J. Shih, Saurabh Singh, Derek Hoiem; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4613-4621

Abstract


We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method maps textual queries and visual features from various regions into a shared space where they are compared for relevance with an inner product. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the recently released VQA dataset, which features free-form human-annotated questions and answers.

Related Material


[pdf] [video]
[bibtex]
@InProceedings{Shih_2016_CVPR,
author = {Shih, Kevin J. and Singh, Saurabh and Hoiem, Derek},
title = {Where to Look: Focus Regions for Visual Question Answering},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2016}
}