PAVE: Patching and Adapting Video Large Language Models

Liu, Zhuoming; Li, Yiquan; Nguyen, Khoi Duc; Zhong, Yiwu; Li, Yin

Zhuoming Liu, Yiquan Li, Khoi Duc Nguyen, Yiwu Zhong, Yin Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 3306-3317

Abstract

We present PAVE, a framework for adapting pre-trained video large language models (Video-LLMs) to downstream tasks that incorporate side-channel signals, such as audio, camera pose, or high frame rate videos. PAVE introduces a lightweight adaptation strategy called "patching", which adds a small number of parameters and operations to the base model without modifying its architecture or pre-trained weights. We demonstrate that PAVE effectively enhances pre-trained Video-LLMs with the cost of adding <1% additional FLOPs and parameters for diverse tasks, including audio-visual understanding, 3D reasoning, and multi-view video understanding, surpassing state-of-the-art task-specific models. Moreover, when applied to high frame rate videos, PAVE further improves video understanding, enhancing the performance of strong base models. Finally, our experiments show that our framework generalizes well across different Video-LLMs.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Liu_2025_CVPR, author = {Liu, Zhuoming and Li, Yiquan and Nguyen, Khoi Duc and Zhong, Yiwu and Li, Yin}, title = {PAVE: Patching and Adapting Video Large Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {3306-3317} }