Adding object detection skills to visual dialogue agents

Bani, Gabriele; Belli, Davide; Dagan, Gautier; Geenen, Alexander; Skliar, Andrii; Venkatesh, Aashish; Baumgartner, Tim; Bruni, Elia; Fernandez, Raquel

Gabriele Bani, Davide Belli, Gautier Dagan, Alexander Geenen, Andrii Skliar, Aashish Venkatesh, Tim Baumgartner, Elia Bruni, Raquel Fernandez; Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0-0

Abstract

Our goal is to equip a dialogue agent that asks questions about a visual scene with object detection skills. We take the first steps in this direction within the GuessWhat?! game. We use Mask R-CNN object features as a replacement for ground-truth annotations in the Guesser module, achieving an accuracy of 57.92%. This proves that our system is a viable alternative to the original Guesser, which achieves an accuracy of 62.77% using ground-truth annotations, and thus should be considered an upper bound for our automated system. Crucially, we show that our system exploits the Mask R-CNN object features, in contrast to the original Guesser augmented with global, VGG features. Furthermore, by automating the object detection in GuessWhat?!, we open up a spectrum of opportunities, such as playing the game with new, non-annotated images and using the more granular visual features to condition the other modules of the game architecture.

Related Material

[pdf]

[bibtex]

@InProceedings{Bani_2018_ECCV_Workshops,
author = {Bani, Gabriele and Belli, Davide and Dagan, Gautier and Geenen, Alexander and Skliar, Andrii and Venkatesh, Aashish and Baumgartner, Tim and Bruni, Elia and Fernandez, Raquel},
title = {Adding object detection skills to visual dialogue agents},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
month = {September},
year = {2018}
}