-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Lin_2025_CVPR, author = {Lin, Kevin Qinghong and Shou, Mike Zheng}, title = {VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {3218-3228} }
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary
Abstract
Human daily activities can be concisely narrated as sequences of routine events (e.g., turning off an alarm) in video streams, forming an event vocabulary. Motivated by this, we introduce **VLog**, a novel video understanding framework that defines video narrations as a vocabulary, going beyond the typical subword vocabularies in existing generative video-language models. Built on the lightweight language model GPT-2, **VLog** features three key innovations:1. **A Generative Retrieval Model** Marrying the language model's complex reasoning capabilities with contrastive retrieval's efficient similarity search.2. **A Hierarchical Vocabulary** Derived from large-scale video narrations using our narration pair encoding algorithm, enabling efficient indexing of specific events (e.g., cutting a tomato) by identifying broader scenarios (e.g., kitchen) with expressive postfixes (e.g., by the left hand).3. **A Vocabulary Update Strategy** Leveraging generative models to extend the vocabulary for novel events encountered during inference.To validate our approach, we introduce **VidCab-Eval**, a development set requiring concise narrations with reasoning relationships (e.g., before and after). Experiments on **EgoSchema**, **COIN**, and **HiREST** further demonstrate the effectiveness of **VLog**, highlighting its ability to generate concise, contextually accurate, and efficient narrations. This offers a novel perspective on video understanding.
Related Material