Question-Guided Hybrid Convolution for Visual Question Answering

Peng Gao, Hongsheng Li, Shuang Li, Pan Lu, Yikang Li, Steven C.H. Hoi, Xiaogang Wang; Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 469-485


In this paper, we propose a novel Question-Guided Hybrid Convolution (QGHC) network for Visual Question Answering (VQA). Most state-of-the-art VQA methods fuse the high-level textual and visual features from the neural network and abandon the visual spatial information when learning multi-modal features. To address these problems, question-guided kernels generated from the input question are designed to convolute with visual features for capturing the textual and visual relationship in the early stage. The question-guided convolution can tightly couple the textual and visual information but also introduces more parameters when learning kernels. We apply the group convolution which consists of question-independent kernels and question-dependent kernels, to reduce the parameter size and release the over-fitting. The hybrid convolution can generate discriminative multi-modal features with fewer parameters. Our proposed approach is also complementary to existing bilinear pooling fusion methods and attention methods. Integration with them could further boost the performance. Extensive experiments on public VQA datasets validate the effectiveness of QGHC.

Related Material

[pdf] [arXiv]
author = {Gao, Peng and Li, Hongsheng and Li, Shuang and Lu, Pan and Li, Yikang and Hoi, Steven C.H. and Wang, Xiaogang},
title = {Question-Guided Hybrid Convolution for Visual Question Answering},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV)},
month = {September},
year = {2018}