Where to Look: Focus Regions for Visual Question Answering
Kevin J. Shih, Saurabh Singh, Derek Hoiem; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4613-4621
Abstract
We present a method that learns to answer visual questions by selecting image regions relevant to the text-based query. Our method maps textual queries and visual features from various regions into a shared space where they are compared for relevance with an inner product. Our method exhibits significant improvements in answering questions such as "what color," where it is necessary to evaluate a specific location, and "what room," where it selectively identifies informative image regions. Our model is tested on the recently released VQA dataset, which features free-form human-annotated questions and answers.
Related Material
[pdf]
[video]
[
bibtex]
@InProceedings{Shih_2016_CVPR,
author = {Shih, Kevin J. and Singh, Saurabh and Hoiem, Derek},
title = {Where to Look: Focus Regions for Visual Question Answering},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2016}
}