Re-thinking Temporal Search for Long-Form Video Understanding

Ye, Jinhui; Wang, Zihan; Sun, Haosen; Chandrasegaran, Keshigeyan; Durante, Zane; Eyzaguirre, Cristobal; Bisk, Yonatan; Niebles, Juan Carlos; Adeli, Ehsan; Fei-Fei, Li; Wu, Jiajun; Li, Manling

Jinhui Ye, Zihan Wang, Haosen Sun, Keshigeyan Chandrasegaran, Zane Durante, Cristobal Eyzaguirre, Yonatan Bisk, Juan Carlos Niebles, Ehsan Adeli, Li Fei-Fei, Jiajun Wu, Manling Li; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 8579-8591

Abstract

Efficient understanding of long-form videos remains a significant challenge in computer vision. In this work, we revisit temporal search paradigms for long-form video understanding, studying a fundamental issue pertaining to all state-of-the-art (SOTA) long-context vision-language models (VLMs). In particular, our contributions are two-fold: **First**, we formulate temporal search as a **Long Video Haystack** problem, i.e., finding a minimal set of relevant frames (typically one to five) among tens of thousands of frames from real-world long videos given specific queries. To validate our formulation, we create **LV-Haystack**, the first benchmark containing 3,874 human-annotated instances with fine-grained evaluation metrics for assessing keyframe search quality and computational efficiency. Experimental results on LV-Haystack highlight a significant research gap in temporal search capabilities, with SOTA keyframe selection methods achieving only 2.1% temporal F1 score on the LVBench subset.**Next**, inspired by visual search in images, we re-think temporal searching and propose a lightweight keyframe searching framework, T^* , which casts the expensive temporal search as a spatial search problem. T^* leverages superior visual localization capabilities typically used in images and introduces an adaptive zooming-in mechanism that operates across both temporal and spatial dimensions. Our extensive experiments show that when integrated with existing methods, T^* significantly improves SOTA long-form video understanding performance. Specifically, under an inference budget of 32 frames, T^* improves GPT-4o's performance from 50.5% to **53.1%** and LLaVA-OneVision-72B's performance from 56.5% to **62.4%** on LongVideoBench XL subset. Our PyTorch code, benchmark dataset and models are included in the Supplementary material.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Ye_2025_CVPR, author = {Ye, Jinhui and Wang, Zihan and Sun, Haosen and Chandrasegaran, Keshigeyan and Durante, Zane and Eyzaguirre, Cristobal and Bisk, Yonatan and Niebles, Juan Carlos and Adeli, Ehsan and Fei-Fei, Li and Wu, Jiajun and Li, Manling}, title = {Re-thinking Temporal Search for Long-Form Video Understanding}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {8579-8591} }