Myopia Rectification: KV Cache Pruning for MLLMs Via Dynamic Attention Subsidy and Token Reclamation

Jiedong Zhuang, Lu Lu, Ming Dai, Jian Chen, Qiang Liu, Haoji Hu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 9023-9033

Abstract


The computational demands of visual tokens in Multimodal Large Language Models (MLLMs) present a significant challenge for real-time inference, particularly in long-contextual scenarios. The Key-Value cache generated by these visual tokens requires substantial GPU memory, slowing down decoding speed. While existing KV cache compression methods have made some progress in language and single-image multimodal contexts, there has been limited exploration in video-based multi-image scenarios. To bridge this gap, we conduct an in-depth analysis of the attention mechanism in MLLMs and identify an intriguing phenomenon termed "Attention Myopia": the model inadequately allocates attention to later images, which impedes the direct application of existing compression methods to multi-image contexts. Based on this insight, we propose Dynamic Attention Subsidy (DAS), a novel technique that adjusts attention scores to ensure a more balanced selection of critical tokens. Additionally, to preserve the integrity of visual information, we propose a reclamation mechanism called "Recycling Bin", which compresses the discarded tokens to a minimal size. Our method achieves around 1.5x speedup on inference in decoding phase with 20% of the KV cache footprint compared to the original model, while maintaining consistent performance across over 20 diverse datasets. Crucially, our technique is plug-and-play, allowing seamless integration with existing pre-trained MLLMs without incurring any additional training costs.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhuang_2026_CVPR, author = {Zhuang, Jiedong and Lu, Lu and Dai, Ming and Chen, Jian and Liu, Qiang and Hu, Haoji}, title = {Myopia Rectification: KV Cache Pruning for MLLMs Via Dynamic Attention Subsidy and Token Reclamation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {9023-9033} }