CadenceRAG: Context-Aware and Dependency-Enhanced Retrieval Augmented Generation for Holistic Video Understanding

Liu, Heng; Jiang, Siru; Duan, Fangyun; Lyu, Yongzhe; Wang, Xiusong; Ge, Hanlin; Liang, Chao

Heng Liu, Siru Jiang, Fangyun Duan, Yongzhe Lyu, Xiusong Wang, Hanlin Ge, Chao Liang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 3718-3727

Abstract

This paper addresses the challenging problem of holistic video understanding, focusing on rich-text-based video retrieval and question answering. Compared to simple video retrieval tasks with concise queries, complex multi-scene queries faces two major challenges: (1) The rich semantics in text queries often depict complex multi-scene narratives with fine-grained details, making it difficult to align them with target video segments both locally and globally; (2) Relevant segments, though spanning multiple scenes, are still a minuscule fraction compared to the vast video corpus, making precise localization extremely challenging. To address these challenges, we propose a novel approach, CadenceRAG, which introduces a unified Retrieval-Augmented Generation (RAG) framework for Known-Item Search (KIS) and Question Answering (QA). By strategically decomposing rich textual queries into temporally ordered sub-queries and employing Hierarchical Sliding Window (HSW), our method precisely aligns and locates relevant video segments. The retrieved segments, along with their associated multimodal information, are integrated into the RAG framework to enhance contextual grounding and generate accurate, context-aware results. We evaluated our method on the IViSE competition, achieving a score of 9.5/10 in KIS and 8/10 in QA, demonstrating strong performance in both tasks.

Related Material

[pdf]

[bibtex]

@InProceedings{Liu_2025_CVPR, author = {Liu, Heng and Jiang, Siru and Duan, Fangyun and Lyu, Yongzhe and Wang, Xiusong and Ge, Hanlin and Liang, Chao}, title = {CadenceRAG: Context-Aware and Dependency-Enhanced Retrieval Augmented Generation for Holistic Video Understanding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3718-3727} }