Patch Ranking: Token Pruning as Ranking Prediction for Efficient CLIP

Cheng-En Wu, Jinhong Lin, Yu Hen Hu, Pedro Morgado; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 5842-5851

Abstract


Contrastive image-text pre-trained models such as CLIP have shown remarkable adaptability to downstream tasks. However they face challenges due to the high computational requirements of the Vision Transformer (ViT) backbone. Current strategies to boost ViT efficiency focus on pruning patch tokens but fall short in addressing the multimodal nature of CLIP and identifying the optimal subset of tokens for maximum performance. To address this we propose greedy search methods to establish a "Golden Ranking" and introduce a lightweight predictor specifically trained to approximate this Ranking. To compensate for any performance degradation resulting from token pruning we incorporate learnable visual tokens that aid in restoring and potentially enhancing the model's performance. Our work presents a comprehensive and systematic investigation of pruning tokens within the ViT backbone of CLIP models. Through our framework we successfully reduced 40% of patch tokens in CLIP's ViT while only suffering a minimal average accuracy loss of 0.3% across seven datasets. Our study lays the groundwork for building more computationally efficient multimodal models without sacrificing their performance addressing a key challenge in the application of advanced vision-language models.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wu_2025_WACV, author = {Wu, Cheng-En and Lin, Jinhong and Hu, Yu Hen and Morgado, Pedro}, title = {Patch Ranking: Token Pruning as Ranking Prediction for Efficient CLIP}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5842-5851} }