T2VBench: Benchmarking Temporal Dynamics for Text-to-Video Generation

Pengliang Ji, Chuyang Xiao, Huilin Tai, Mingxiao Huo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 5325-5335

Abstract


While text-to-video (T2V) generative models produce exceptionally realistic videos they lack a comprehensive evaluation across the temporal dimension with a limited focus on basic dynamics including camera transitions movement and event sequences. In this work we introduce T2VBench a comprehensive T2V evaluation benchmark enriched with temporal dynamics lexicons derived from curated temporal words on Wikipedia. T2VBench is a hierarchical evaluation framework comprising over 1600 temporally rich prompts and 5000 generated videos with human ratings spanning 16 critical temporal evaluation dimensions. We assess three leading text-to-video models including ZeroScope and Pika to gauge their proficiency in handling temporal dynamics. Our analysis highlights the strengths and limitations of these models across various temporal aspects. Furthermore we provide insights into future directions for enhancing text-to-video evaluation metrics and offer a detailed analysis of these models' performance across the temporal dimensions. Overall T2VBench is the first-of-its-kind comprehensive benchmark fully focused on temporal dynamics for text-to-video evaluation. It aims to facilitate scientific benchmarking of both generative models and automated metrics on text-to-video generation.

Related Material


[pdf]
[bibtex]
@InProceedings{Ji_2024_CVPR, author = {Ji, Pengliang and Xiao, Chuyang and Tai, Huilin and Huo, Mingxiao}, title = {T2VBench: Benchmarking Temporal Dynamics for Text-to-Video Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {5325-5335} }