-
[pdf]
[bibtex]@InProceedings{Zhang_2025_CVPR, author = {Zhang, Zhichao and Li, Xinyue and Sun, Wei and Zhang, Zicheng and Li, Yunhao and Liu, Xiaohong and Zhai, Guangtao}, title = {Leveraging Multimodal Large Language Models for Joint Discrete and Continuous Evaluation in Text-to-Image Alignment}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {977-986} }
Leveraging Multimodal Large Language Models for Joint Discrete and Continuous Evaluation in Text-to-Image Alignment
Abstract
Text-to-image (T2I) generation has seen rapid advancements with the development of powerful diffusion-based and transformer-based models. These models enable the creation of both artistic illustrations and highly photorealistic images, making it increasingly important to accurately evaluate how well the generated images align with their corresponding text prompts. In this paper, we propose a novel method for evaluating image-text alignment that leverages advanced multimodal large language models (MLLMs). First, we develop a specialized prompt engineering strategy that targets fine-grained elements, such as actions, spatial relationships, quantities, and orientations, guiding the model to capture subtle details in both the textual and visual modalities. Second, we perform supervised fine-tuning using a dual-loss strategy to minimize discrepancies between predicted continuous scores and ground truth, thereby providing a more precise measure of alignment. Lastly, we propose a regression retraining approach that extracts intermediate features from the MLLM's decoder and employs a multilayer perceptron to predict alignment scores. The experimental results demonstrate that the proposed methods significantly improve both overall and fine-grained alignment evaluations, offering a robust solution for T2I alignment assessment.
Related Material