-
[pdf]
[supp]
[bibtex]@InProceedings{Cho_2025_ICCV, author = {Cho, Jaemin and Mahata, Debanjan and Irsoy, Ozan and He, Yujie and Bansal, Mohit}, title = {M3DocVQA: Multi-modal Multi-page Multi-document Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {6178-6188} }
M3DocVQA: Multi-modal Multi-page Multi-document Understanding
Abstract
Document Visual Question Answering (DocVQA) offers a promising approach to extracting insights from large document corpora. However, existing benchmarks focus on evaluating multi-modal understanding within a single document. This gap hinders the development of methods integrating scattered information across pages and documents. To address this, we introduce M3DocVQA, the first benchmark designed for multi-modal, multi-page, and multi-document understanding. M3DocVQA comprises over 3,000 PDF documents with more than 40,000 pages, offering a challenging environment where evidence is distributed across diverse sources and modalities. Alongside the dataset, we introduce M3DocRAG, a baseline method based on multi-modal retrieval-augmented generation. M3DocRAG flexibly handles both single and multiple document settings while preserving critical visual information, establishing a useful starting point for future work in open-domain multi-modal document understanding. Our experiments across three benchmarks (M3DocVQA, MMLongBench-Doc, and MP-DocVQA) show that existing methods struggle with open-domain question answering over extensive, multi-modal documents. Although M3DocRAG has shown promising performance, there is large room for future improvement. We provide comprehensive ablation studies of different indexing, multi-modal language models, and multi-modal retrieval models, along with qualitative examples to guide future research.
Related Material
