ZINA: Multimodal Fine-grained Hallucination Detection and Editing

Yuiga Wada, Kazuki Matsuda, Komei Sugiura, Graham Neubig; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 32528-32538

Abstract


Multimodal Large Language Models (MLLMs) often generate hallucinations, where the output deviates from the visual content. Given that these hallucinations can take diverse forms, detecting hallucinations at a fine-grained level is essential for comprehensive evaluation and analysis. To this end, we propose a novel task of multimodal fine-grained hallucination detection and editing for MLLMs. Moreover, we propose ZINA, a novel method that identifies hallucinated spans at a fine-grained level, classifies their error types into six categories, and suggests appropriate refinements. To train and evaluate models for this task, we constructed VisionHall, a dataset comprising 6.9k outputs from twelve MLLMs manually annotated by 211 annotators, and 20k synthetic samples generated using a graph-based method that captures dependencies among error types. We demonstrated that ZINA outperformed existing methods, including GPT-4o and Llama-3.2, in both detection and editing tasks.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Wada_2026_CVPR, author = {Wada, Yuiga and Matsuda, Kazuki and Sugiura, Komei and Neubig, Graham}, title = {ZINA: Multimodal Fine-grained Hallucination Detection and Editing}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {32528-32538} }