-
[pdf]
[supp]
[bibtex]@InProceedings{Huo_2026_CVPR, author = {Huo, Simin and Li, Ning}, title = {MaMe: Matrix-Based Token Merging}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {2863-2872} }
MaMe: Matrix-Based Token Merging
Abstract
We introduce MaMe, a training-free, differentiable token merging method that relies entirely on matrix operations to accelerate vision transformers. When applied to pre-trained models, MaMe doubles ViT-B@224 throughput with a 2% drop in accuracy. For training from scratch, a ViT-T model with MaMe achieves 1.94x throughput with a 1.3% accuracy drop. As a downsampling layer in Iwin models, MaMe dramatically reduced Iwin-S' GFLOPs from 9.0 to 1.8 with a 12.4% accuracy drop. In SigLIP2-B@512 zero-shot classification, MaMe provides 1.3x acceleration with negligible performance degradation (78.02 vs. 78.37). For multimodal reasoning, MaMe accelerates LLaVA-v1.5-7B inference by 36% on MME with minimal degradation (31.40 vs. 32.76). In video tasks, MaMe accelerates VideoMAE-L by 48.5% on Kinetics-400 with a 0.84% accuracy loss. Furthermore, MaMe achieves simultaneous improvements in both performance and speed on the COCO Caption task, significantly boosting CIDEr to 2.71 compared to the baseline's 0.71 with a speedup of 16%. Collectively, these results demonstrate MaMe's effectiveness in accelerating transformer-based vision models.
Related Material

