Inflation With Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Xin Yuan, Jinoo Baek, Keyang Xu, Omer Tov, Hongliang Fei; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2024, pp. 489-496

Abstract


We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach that leverages the readily learned capacity of image diffusion to capture spatial information for video generation. To accomplish this goal, we design an efficient architecture by inflating the weightings of the text-to-image SR model into our video generation framework. Additionally, we incorporate a temporal adapter to ensure temporal coherence across video frames. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality. Empirical evaluation, both quantitative and qualitative, on the Shutterstock video dataset, demonstrates that our approach is able to perform text-to-video SR generation with good visual quality and temporal consistency. To evaluate temporal coherence, we also present visualizations in video format in [google drive link].

Related Material


[pdf]
[bibtex]
@InProceedings{Yuan_2024_WACV, author = {Yuan, Xin and Baek, Jinoo and Xu, Keyang and Tov, Omer and Fei, Hongliang}, title = {Inflation With Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2024}, pages = {489-496} }