TlTScore: Towards Long-Tail Effects in Text-to-Visual Evaluation with Generative Foundation Models

Ji, Pengliang; Liu, Junchen

Pengliang Ji, Junchen Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 5302-5313

Abstract

Evaluation of generative foundation models (GenFMs) for text-to-visual tasks has been enhanced by automatic alignment metrics such as CLIPScore complementing human feedback. However existing evaluation methods suffer from a severe long-tail effect where the balance between token count and semantic validity in the initial step hinders the accurate evaluation of advanced aspects such as composition. We analyze this drawback and attribute it to a lack of symbolic reasoning attention while GenFMs demonstrate strong discriminative abilities in handling symbolism. To this end we propose a pioneering paradigm for evaluating GenFMs' text-to-visual (T2V) generation using neuro-symbolic thinking to mitigate the long-tail effect. By explicitly embedding Mixture-of-experts (MoE) Large Vision Models (LVMs) we introduce symbolic-level understanding while maintaining the strong neuro-level reasoning capability. Through the fusion of semantic and compositional knowledge at the neuro-to-symbolic level our approach outperforms state-of-the-art T2V evaluation methods exhibiting stronger compositional reasoning ability on Winoground and better alignment with human judgment. We also demonstrate our impressive effectiveness on diverse tasks including text-to-3D and text-to-video. To further advance the T2V evaluation of GenFMs we propose a challenging benchmark that includes richer and more diverse compositional and semantic information compared to Winoground. Overall our work opens a new direction for neuro-to-symbolic visio-linguistic evaluation of GenFMs and aims to drive further progress in the field.

Related Material

[pdf]

[bibtex]

@InProceedings{Ji_2024_CVPR, author = {Ji, Pengliang and Liu, Junchen}, title = {TlTScore: Towards Long-Tail Effects in Text-to-Visual Evaluation with Generative Foundation Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {5302-5313} }