Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation

Brinnae Bent; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 8218-8222

Abstract


In this study we identify the need for an interpretable quantitative score of the repeatability or consistency of image generation in diffusion models. We propose a semantic approach using a pairwise mean CLIP (Contrastive Language-Image Pretraining) score as our semantic consistency score. We applied this metric to compare two state-of-the-art open-source image generation diffusion models Stable Diffusion XL and PixArt-? and we found statistically significant differences between the semantic consistency scores for the models. Agreement between the semantic consistency score selected model and aggregated human annotations was 94%. We also explored the consistency of SDXL and a LoRA-fine-tuned version of SDXL and found that the fine-tuned model had significantly higher semantic consistency in generated images. The Semantic Consistency Score proposed here offers a measure of image generation alignment facilitating the evaluation of model architectures for specific tasks and aiding in informed decision-making regarding model selection.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Bent_2024_CVPR, author = {Bent, Brinnae}, title = {Semantic Approach to Quantifying the Consistency of Diffusion Model Image Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {8218-8222} }