Enhancing Video Vision Language Model with Hippocampal Sensing

Cao, Xu

Xu Cao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 33682-33692

Abstract

Current video vision language models (VLMs) process information passively, lacking the ability to dynamically plan their analysis or perform joint reasoning across crucial modalities such as video and audio. To address this, we introduce Hippocampal Sensing (HippoSense), a learning paradigm inspired by hippocampus that shifts the focus from temporal predictive sensing to cross-modal predictive sensing. The core objective of HippoSense is to train the model to anticipate current status's audio-caption summarizations from video and vice versa. We present HippoVLM, a video VLM that operationalizes this paradigm. Instead of passively ingesting all data, HippoVLM actively reasons about its information needs using Chain-of-Thought (CoT). Our training process is twofold: we first finetune HippoVLM with HippoSense, and then apply a novel contrastive Reinforcement Learning (RL) algorithm, Video-Audio Negative-aware Optimization (VANAO), to optimize this selective co-reasoning process. This approach proves highly effective: despite their significantly smaller size, our HippoVLM model achieve competitive performance to massive MLLMs like GPT-4o and Gemini 1.5 Pro on multiple video VQA benchmarks.

Related Material

[pdf]

[bibtex]

@InProceedings{Cao_2026_CVPR, author = {Cao, Xu}, title = {Enhancing Video Vision Language Model with Hippocampal Sensing}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {33682-33692} }