-
[pdf]
[bibtex]@InProceedings{Yu_2025_ICCV, author = {Yu, Jiayang and Xie, Yuxi and Zhang, Guixuan and Liu, Jie and Zeng, Zhi and Huang, Ying and Zhang, Shuwu}, title = {MIND-RAG: Multimodal Context-Aware and Intent-Aware Retrieval-Augmented Generation for Educational Publications}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4216-4223} }
MIND-RAG: Multimodal Context-Aware and Intent-Aware Retrieval-Augmented Generation for Educational Publications
Abstract
Although multimodal Retrieval-Augmented Generation (RAG) systems have demonstrated wide applicability, they still suffer from limited image interpretability and limited retrieval performance when processing domain-specific documents. To address these challenges, we propose Multimodal INtent-Driven Retrieval-Augmented Generation (MIND-RAG), a novel framework tailored for educational scientific journals. MIND-RAG introduces two core innovations: (1) Context-aware image summarization, which extracts relevant textual context surrounding each image and uses it as a prompt to generate semantic summaries via a multimodal large model, enabling subsequent text-only retrieval; (2) Multimodal Intent-Aware Reranking, which jointly infers users' intent based on their latent needs for multimodal querying (e.g. picture, tabular, textual) and educational domain categories, and refines the ranking of retrieved results by aligning document thematic and modality-specific relevance with the query's inferred intent. Evaluated on the MEED-QA benchmark comprising educational journal entries spanning 10 years, MIND-RAG achieves 84.0% accuracy on complex Question Answering (QA) tasks and a 93.4% Mean Reciprocal Rank (MRR) for multimodal retrieval. These results demonstrate the effectiveness of MIND-RAG in real-world publication-based retrieval scenarios.
Related Material
