M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA

Venna, Venkata Kesav; Gunda, Sai Madhusudan; Jinka, Jyothi Swaroopa; Rachakonda, Hrithik Sagar; Srinivasan, Anirudh; Sarvadevabhatla, Ravi Kiran

Venkata Kesav Venna, Sai Madhusudan Gunda, Jyothi Swaroopa Jinka, Hrithik Sagar Rachakonda, Anirudh Srinivasan, Ravi Kiran Sarvadevabhatla; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 23685-23695

Abstract

**Document QA** requires not only accurate answers but also identifying where each answer is grounded on the page. Most models treat the task as text-only generation, while existing answer grounding methods generate coarse bounding boxes that fail to capture curved text. We introduce **M3Grounder, a hybrid vision-language and segmentation architecture that formulates document grounding as pixel-level segmentation. It produces fine-grained evidence masks** refined by a bleed-suppression loss to prevent spillover. M3Grounder autoregressively generates answer text interleaved with [GROUND] tokens that link individual answer spans to their corresponding evidence regions. Also, **M3Grounder grounds evidence hierarchically across phrase, line, and block levels** using an enclosure loss that enforces spatial containment. We release **GroundingDocQA dataset (200K documents, 2M multi-span and multi-granular QA pairs with pixel-level grounding masks)**, built through a data engine that handles complex layouts, curved-text, and graphics-rich documents. We also release **GroundingDocQA-Bench, a diverse and challenging human-verified benchmark**. M3Grounder sets **a new state of the art in grounded DocVQA, advancing from coarse boxes to hierarchical, fine-grained and contextually grounded evidence.**

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Venna_2026_CVPR, author = {Venna, Venkata Kesav and Gunda, Sai Madhusudan and Jinka, Jyothi Swaroopa and Rachakonda, Hrithik Sagar and Srinivasan, Anirudh and Sarvadevabhatla, Ravi Kiran}, title = {M3Grounder: Mask-Based Multi-Span and Multi-Granular Grounding for Document QA}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {23685-23695} }