Make VLM Recognize Visual Hallucination on Cartoon Character Image with Pose Information

Kim, Bumsoo; Shin, Wonseop; Lee, Kyuchul; Jung, Yonghoon; Seo, Sanghyun

Bumsoo Kim, Wonseop Shin, Kyuchul Lee, Yonghoon Jung, Sanghyun Seo; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 5398-5407

Abstract

Leveraging large-scale Text-to-Image (TTI) models have become a common technique to generate training or reference data in the field of image synthesis video editing 3D reconstruction. However semantic structural visual hallucinations which contain perceptually critical defects remain a concern especially in non-photorealistic rendering domains such as cartoon pixelization-style character. We propose a novel semantic structural hallucination detection system in cartoon-style images generated by TTI models collecting a new cartoon-hallucination dataset. Our approach leverages pose-aware in-context visual learning (PA-ICVL) with public Vision-Language Models (VLMs) utilizing both RGB images and pose information. By incorporating pose guidance from a fine-tuned pose estimator we enable VLMs to make more accurate decisions. Experimental results demonstrate significant improvements in identifying visual hallucinations compared to baseline methods relying solely on RGB images. Within selected two VLMs GPT-4v Gemini pro vision our proposed PA-ICVL improves the hallucination detection with 50% to 78% 57% to 80% respectively. This research advances a capability of TTI models toward real-world applications by mitigating visual hallucinations via in-context visual learning expanding their potential in non-photorealistic domains. Besides when VLM confront ambiguous tasks this results showcase thought-provoking insights how users boost the domain-adaptive capability of VLM by harnessing additional conditions. The dataset and demo VLMs are provided in the corresponding Git repository: https://github.com/gh-BumsooKim/Cartoon-Hallucinations-Detection.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Kim_2025_WACV, author = {Kim, Bumsoo and Shin, Wonseop and Lee, Kyuchul and Jung, Yonghoon and Seo, Sanghyun}, title = {Make VLM Recognize Visual Hallucination on Cartoon Character Image with Pose Information}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5398-5407} }