MeToM: Metadata-Guided Token Merging for Efficient Video LLMs

Zhuojie Wu, Shijie Wang, Xin Yu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 10441-10450

Abstract


Video Large Language Models (VLLMs) encounter significant computational challenges due to the large volume of visual tokens generated from multiple frames. Existing visual token pruning methods fail to account for the uneven spatiotemporal information density, thus squandering scarce token budgets on regions with low information density. In this paper, we propose a training-free Metadata-guided Token Merging framework (MeToM) that leverages intrinsic video metadata to adaptively allocate budgets and merge visual tokens based on content complexity. Specifically, MeToM exploits residual data from codec metadata as spatial information density cues. It merges less informative regions during tokenization, avoiding redundant encoding and improving the efficiency of the visual encoder. Additionally, MeToM captures temporal variations in information density by utilizing the average Group of Pictures (GoP) packet size to represent scene complexity. This mechanism enables dynamic per-frame token allocation across time, assigning more tokens to content-complex frames and fewer tokens to information-sparse ones. Finally, we merge low-contribution visual tokens via multi-layer attention to reduce prefill FLOPs and the visual KV-cache footprint inside the LLM. Extensive experimental results demonstrate that MeToM outperforms prior state-of-the-art methods. Notably, it achieves a 2.65xinference speedup over the baseline VLLM without sacrificing accuracy.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wu_2026_CVPR, author = {Wu, Zhuojie and Wang, Shijie and Yu, Xin}, title = {MeToM: Metadata-Guided Token Merging for Efficient Video LLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {10441-10450} }