InViG: Benchmarking Open-Ended Interactive Visual Grounding with 500K Dialogues

Zhang, Hanbo; Xu, Jie; Mo, Yuchen; Kong, Tao

Hanbo Zhang, Jie Xu, Yuchen Mo, Tao Kong; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 5508-5518

Abstract

Ambiguity is ubiquitous in human communication. Previous approaches in Human-Robot Interaction (HRI) have often relied on predefined interaction templates leading to reduced performance in realistic and open-ended scenarios. To address these issues we present a large-scale dataset InViG for interactive visual grounding under language ambiguity. Our dataset comprises over 520K images accompanied by open-ended goal-oriented disambiguation dialogues encompassing millions of object instances and corresponding question-answer pairs. Leveraging the InViG dataset we conduct extensive studies and propose a set of baseline solutions for end-to-end interactive visual disambiguation and grounding achieving a 45.6% success rate during validation. To the best of our knowledge the InViG dataset is the first large-scale dataset for resolving open-ended interactive visual grounding pre- senting a practical yet highly challenging benchmark for ambiguity-aware HRI. Codes and datasets are available at: https://openivg.github.io.

Related Material

[pdf]

[bibtex]

@InProceedings{Zhang_2024_CVPR, author = {Zhang, Hanbo and Xu, Jie and Mo, Yuchen and Kong, Tao}, title = {InViG: Benchmarking Open-Ended Interactive Visual Grounding with 500K Dialogues}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {5508-5518} }