-
[pdf]
[bibtex]@InProceedings{Zhang_2024_CVPR, author = {Zhang, Hanbo and Xu, Jie and Mo, Yuchen and Kong, Tao}, title = {InViG: Benchmarking Open-Ended Interactive Visual Grounding with 500K Dialogues}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {5508-5518} }
InViG: Benchmarking Open-Ended Interactive Visual Grounding with 500K Dialogues
Abstract
Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates leading to reduced performance in realistic and open-ended scenarios. To address these issues we present a large-scale dataset InViG for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues encompassing millions of object instances and corresponding question-answer pairs. Leveraging the InViG dataset we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding achieving a 45.6% success rate during validation. To the best of our knowledge the InViG dataset is the first large-scale dataset for resolving open-ended interactive visual grounding pre- senting a practical yet highly challenging benchmark for ambiguity-aware HRI. Codes and datasets are available at: https://openivg.github.io.
Related Material