CLIP-Fusion: A Spatio-Temporal Quality Metric for Frame Interpolation

Çökmez, Goksel Mert; Zhang, Yang; Schroers, Christopher; Aydin, Tunç Ozan

Goksel Mert Çökmez, Yang Zhang, Christopher Schroers, Tunç Ozan Aydin; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 7450-7459

Abstract

Video frame interpolation is an ill-posed problem and a wide variety of methods have been proposed ranging from more traditional computer vision strategies to the most recent developments with neural network models. While there are many methods to interpolate video frames quality assessment regarding the resulting artifacts from these methods remains dependent on off-the-shelf methods. Although these methods can make accurate quality predictions for many visual artifacts such as compression blurring and banding their performance is mediocre for video frame interpolation artifacts due to the unique spatio-temporal qualities of such artifacts. To address this we aim to leverage semantic feature extraction capabilities of the pre-trained visual backbone of CLIP. Specifically we adapt its multi-scale approach to our feature extraction network and combine it with the spatio-temporal attention mechanism of the Video Swin Transformer. This allows our model to detect interpolation-related artifacts across frames and predict the relevant differential mean opinion score. Our model outperforms existing state-of-the-art quality metrics for assessing the quality of interpolated frames in both full-reference and no-reference settings.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Cokmez_2025_WACV, author = {\c{C}\"okmez, Goksel Mert and Zhang, Yang and Schroers, Christopher and Aydin, Tun\c{c} Ozan}, title = {CLIP-Fusion: A Spatio-Temporal Quality Metric for Frame Interpolation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {7450-7459} }