Rethinking Token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks

Cheng Lei, Ao Li, Hu Yao, Ce Zhu, Le Zhang; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 14954-14964

Abstract


Parameter-efficient fine-tuning (PEFT) adapts pre-trained models to new tasks by updating only a small subset of parameters, achieving efficiency but still facing significant inference costs driven by input token length. This challenge is even more pronounced in pixel-level tasks, which require longer input sequences compared to image-level tasks. Although token reduction (TR) techniques can help reduce computational demands, they often lead to homogeneous attention patterns that compromise performance in pixel-level scenarios. This study underscores the importance of maintaining attention diversity for these tasks and proposes to enhance attention diversity while ensuring the completeness of token sequences. Our approach effectively reduces the number of tokens processed within transformer blocks, improving computational efficiency without sacrificing performance on several pixel-level tasks. We also demonstrate the superior generalization capability of our proposed method compared to challenging baseline models. The source code will be made available at https://github.com/AVC2-UESTC/DAR-TR-PEFT.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Lei_2025_CVPR, author = {Lei, Cheng and Li, Ao and Yao, Hu and Zhu, Ce and Zhang, Le}, title = {Rethinking Token Reduction with Parameter-Efficient Fine-Tuning in ViT for Pixel-Level Tasks}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {14954-14964} }