Video Quality Assessment Based on Swin Transformer With Spatio-Temporal Feature Fusion and Data Augmentation
While video enhancement has drawn significant interest and has been extensively studied by academia and industry, the corresponding research on video quality assessment (VQA) for enhanced video has not been widely addressed. Video enhancement methods normally change the relevant metrics like brightness, contrast, color, etc., leading to the fluctuation of perceptual quality and challenging the related VQA task. In this paper, we propose a novel approach for VQA task based on Swin Transformer with improved spatio-temporal feature fusion, which precisely mines the stage-wise feature concatenation and provides competitive assessment performance. In addition, we propose an efficient data augmentation strategy to improve data diversity and further enhance assessment accuracy. Experimental results demonstrate that the proposed approach achieves state-of-the-art performance on two benchmark VQA datasets, and ranks first in CVPR NTIRE 2023 Quality Assessment for Video Enhancement Challenge, which proves that the proposed approach is not only promising in VQA for enhanced video but also ubiquitous in general VQA tasks.