-
[pdf]
[supp]
[bibtex]@InProceedings{Quan_2025_CVPR, author = {Quan, Khanh-An C. and Nguyen, Qui Ngoc and Luu, Duc-Tuan}, title = {Toward Automation in Text-based Video Retrieval with LLM Assistance}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3708-3716} }
Toward Automation in Text-based Video Retrieval with LLM Assistance
Abstract
Video retrieval is a challenging task that requires capturing both semantic relevance and temporal consistency to return accurate results. Recent advancements in Vision-Language Models (VLMs) have significantly improved retrieval performance, but challenges remain in handling complex queries and ensuring robust ranking. To address these issues, we explore the integration of Large Language Models (LLMs) into multiple modules within the retrieval process to enhance video retrieval using textual queries. Specifically, we propose four key modules: Temporal-assisted retrieval, Query refinement, Results reranking, and Multimodal combinations. These modules leverage LLMs to improve temporal understanding, refine input queries, rerank retrieval results, and integrate multimodal cues, ultimately enhancing the relevance and accuracy of retrieved video segments. Our study focuses on a fully automated retrieval system, where queries are processed without human intervention. Comprehensive experiments on the textual KIS dataset from the Video Browser Showdown (VBS) competition demonstrate that LLM-assisted retrieval significantly improves retrieval performance. The proposed framework outperforms conventional approaches by effectively handling complex search queries and improving retrieval accuracy, highlighting the potential of LLMs in automated video search.
Related Material