Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Sixian Zhang, Bohan Wang, Junqiang Wu, Yan Li, Tingting Gao, Di Zhang, Zhongyuan Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 8018-8027

Abstract


Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images they reduce the rich tapestry of human preference to a single overall score. However the preference results vary when humans evaluate images with different aspects. Therefore to learn the multi-dimensional human preferences we propose the Multi-dimensional Preference Score (MPS) the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset which comprises 918315 human preference choices across four dimensions (i.e. aesthetics semantic alignment detail quality and overall assessment) on 607541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions enabling it a promising metric for evaluating and improving text-to-image generation. The model and dataset will be made publicly available to facilitate future research.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Zhang_2024_CVPR, author = {Zhang, Sixian and Wang, Bohan and Wu, Junqiang and Li, Yan and Gao, Tingting and Zhang, Di and Wang, Zhongyuan}, title = {Learning Multi-Dimensional Human Preference for Text-to-Image Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {8018-8027} }