ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers

Narges Norouzi, Svetlana Orlova, Daan de Geus, Gijs Dubbelman; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 15773-15782

Abstract


This work presents Adaptive Local-then-Global Merging (ALGM) a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer it merges similar tokens within a small local window and (2) halfway through the network it merges similar tokens across the entire image. This is motivated by an analysis in which we found that in those situations tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations we show that ALGM not only significantly improves the throughput by up to 100% but can also enhance the mean IoU by up to +1.1 thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover our approach is adaptive during inference meaning that the same model can be used for optimal efficiency or accuracy depending on the application. Code is available at https://tue-mps.github.io/ALGM.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Norouzi_2024_CVPR, author = {Norouzi, Narges and Orlova, Svetlana and de Geus, Daan and Dubbelman, Gijs}, title = {ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {15773-15782} }