-
[pdf]
[bibtex]@InProceedings{Bompilwar_2025_ICCV, author = {Bompilwar, Ritik and Koshatwar, Saurabh}, title = {QualiVision: Multi-Modal Video Quality Assessment with Quality-Aware Fusion and Discriminative Learning Strategies}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {3469-3478} }
QualiVision: Multi-Modal Video Quality Assessment with Quality-Aware Fusion and Discriminative Learning Strategies
Abstract
The proliferation of AI-generated video content across media production, entertainment, and social platforms has created new challenges in automatic quality assessment. While traditional metrics focus primarily on technical artifacts, AI-generated content requires comprehensive evaluation across temporal consistency, image fidelity, aesthetic appeal, and text-video alignment to ensure semantic coherence between input prompts and visual outputs. We present QualiVision, a multi-modal video quality assessment framework designed to address these requirements. Working with the VQualA 2025 challenge dataset of 4,000 annotated AI-generated videos, we develop two complementary architectures that integrate video understanding with textual prompt analysis. Our first approach enhances DOVER++ with a quality-aware cross-modal fusion mechanism that dynamically adapts attention based on textual content. The second employs V-JEPA2 with strategic layer freezing and discriminative learning rates for efficient adaptation of large-scale transformers. Both models utilize a hybrid loss function combining smooth L1, ranking, and scale-aware components to better align with human perceptual judgments. Experimental evaluation of our V-JEPA2 architecture achieved a final score of 0.57, and DOVER++ obtained 0.39 on the challenge test set, demonstrating the effectiveness of cross-modal understanding for AI-generated video quality assessment.
Related Material
