DCVQE: A Hierarchical Transformer for Video Quality Assessment

Zutong Li, Lei Yang; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 2562-2579


The explosion of user-generated videos stimulates a great demand for no-reference video quality assessment (NR-VQA). Inspired by our observation on the actions of human annotation, we put forward a Divide and Conquer Video Quality Estimator (DCVQE) for NR-VQA. Starting from extracting the frame-level quality embeddings (QE), our proposal splits the whole sequence into a number of shots and applies Transformers to learn the shot-level QE and update the frame-level QE simultaneously; another Transformer is introduced to combine the shot-level QE to generate the video-level QE. We call this hierarchical combination of Transformers as a Divide and Conquer Transformer (DCTr) layer. A great video quality feature extraction can be achieved by repeating the DCTr layer several times. Also, taking the order relationship among the annotated data into account, we propose a novel correlation loss term for training. Experiments confirm that our DCVQE outperforms most other algorithms by a great margin.

Related Material

[pdf] [supp] [arXiv] [code]
@InProceedings{Li_2022_ACCV, author = {Li, Zutong and Yang, Lei}, title = {DCVQE: A Hierarchical Transformer for Video Quality Assessment}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {2562-2579} }