When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach

Ma, Tao; Bai, Bing; Lin, Haozhe; Wang, Heyuan; Wang, Yu; Luo, Lin; Fang, Lu

Tao Ma, Bing Bai, Haozhe Lin, Heyuan Wang, Yu Wang, Lin Luo, Lu Fang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22119-22128

Abstract

Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless recent advances in imaging technology have enabled the acquisition of gigapixel-level images providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable we introduce a novel dataset named GigaGrounding designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding gigapixel-level resolution significant variations in object scales and the "multi-hop expressions". Furthermore we introduced a simple yet effective grounding approach which employs a "glance-to-zoom-in" paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset is available at www.gigavision.ai.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Ma_2024_CVPR, author = {Ma, Tao and Bai, Bing and Lin, Haozhe and Wang, Heyuan and Wang, Yu and Luo, Lin and Fang, Lu}, title = {When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {22119-22128} }