-
[pdf]
[supp]
[bibtex]@InProceedings{Jiang_2025_ICCV, author = {Jiang, Yitong and Gu, Jinwei and Xue, Tianfan and Cheung, Ka Chun and Molchanov, Pavlo and Yin, Hongxu and Liu, Sifei}, title = {Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {24147-24158} }
Token-Efficient VLM: High-Resolution Image Understanding via Dynamic Region Proposal
Abstract
Vision-Language Models (VLMs) excel at visual understanding by leveraging pretrained image encoders to generate visual tokens. However, they struggle with high-resolution images and zoomed-in regions due to the computational burden and token redundancy of uniform patch-based processing, often leading to the loss of critical details. To address these challenges, we propose Token-Efficient Vision Language Model (TEVA), a novel framework that detects key regions and applies dynamic patch sampling to efficiently capture fine-grained details while preserving global context. Our approach first identifies subject-oriented regions using an adaptive detection strategy. Then, a dynamic patch sampling mechanism selects and arranges patches at varying scales, ensuring efficient processing without increasing token count. Extensive experiments demonstrate that Token-Efficient Vision Language Model (TEVA) significantly enhances VLM performance in handling visual details, seamlessly integrating with various decoders and LLMs.
Related Material
