Efficient Transformer Adaptation with Soft Token Merging

Xin Yuan, Hongliang Fei, Jinoo Baek; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 3658-3668

Abstract


We develop an approach to efficiently adapt transformer layers driven by an objective of optimization stability and broad applicability. Unlike existing methods which adopt either simple heuristics or inefficient discrete optimization methods for token sampling we craft a lightweight soft token merging system that maintains end-to-end differentiability while maintaining good task performance. To compensate for the potential information loss we design a novel token inflation module to maximize functionality preservation across different transformer blocks. Experimental results across vision-only language-only and vision-language tasks show that our method achieves comparable accuracies while saving considerable computation costs for both training and inference. We demonstrate that these gains translate into real wall-clock speedups.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Yuan_2024_CVPR, author = {Yuan, Xin and Fei, Hongliang and Baek, Jinoo}, title = {Efficient Transformer Adaptation with Soft Token Merging}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {3658-3668} }