- [pdf] [supp] [arXiv] [code]
DCVQE: A Hierarchical Transformer for Video Quality Assessment
The explosion of user-generated videos stimulates a great demand for no-reference video quality assessment (NR-VQA). Inspired by our observation on the actions of human annotation, we put forward a Divide and Conquer Video Quality Estimator (DCVQE) for NR-VQA. Starting from extracting the frame-level quality embeddings (QE), our proposal splits the whole sequence into a number of shots and applies Transformers to learn the shot-level QE and update the frame-level QE simultaneously; another Transformer is introduced to combine the shot-level QE to generate the video-level QE. We call this hierarchical combination of Transformers as a Divide and Conquer Transformer (DCTr) layer. A great video quality feature extraction can be achieved by repeating the DCTr layer several times. Also, taking the order relationship among the annotated data into account, we propose a novel correlation loss term for training. Experiments confirm that our DCVQE outperforms most other algorithms by a great margin.