-
[pdf]
[bibtex]@InProceedings{Zhang_2025_CVPR, author = {Zhang, Weixia and Zheng, Bingkun and Chen, Junlin and Wang, Zhihua}, title = {Multi-Dimensional Quality Assessment for UGC Videos via Modular Multi-Modal Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {1557-1566} }
Multi-Dimensional Quality Assessment for UGC Videos via Modular Multi-Modal Vision-Language Models
Abstract
Recent advances in video processing and the growth of social media have led to a surge in user-generated content (UGC) videos. However, various factors can degrade their quality, underscoring the need for robust video quality assessment (VQA) models to optimize devices, monitor quality, and enhance recommendation systems. While current VQA models can accurately evaluate the overall quality of UGC videos, they do not offer fine-grained assessments, making it difficult to pinpoint the sources of quality issues. In this work, we introduce a VQA model that evaluates UGC videos along six quality dimensions: color, noise, artifacts, blur, temporal consistency, and overall quality. We formulate the multi-dimensional VQA task as modeling the joint distribution of all quality dimensions, encouraging our model to learn the intrinsic mechanisms by which different factors influence perceived video quality. We utilize emerging multi-modal vision-language models as the base quality evaluators, supplementing them with two additional modules that deliver complementary information to deepen video quality understanding. Special care is also taken to handle UGC videos with various aspect ratios, enabling us to process UGC videos at their appropriate resolutions. Specifically, we adopt the NaFlex variant of the SigLIP-2 model, which adaptively resizes video frames based on their original resolutions and aspect ratios. We also employ a multi-modal large language model (MLLM) as the base quality predictor (a variant of Q-Align), contributing additional improvements through model ensemble in our final quality prediction. Experimental results show that the proposed model outperforms others on the FineVQ dataset, confirming its effectiveness. The source code will be made publicly accessible.
Related Material

