Token Pooling in Vision Transformers for Image Classification

Dmitrii Marin, Jen-Hao Rick Chang, Anurag Ranjan, Anish Prabhu, Mohammad Rastegari, Oncel Tuzel; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 12-21


Pooling is commonly used to improve the computation-accuracy trade-off of convolutional networks. By aggregating neighboring feature values on the image grid, pooling layers downsample feature maps while maintaining accuracy. In transformers, however, tokens are processed individually and do not necessarily lie on regular grids. Utilizing pooling methods designed for image grids (e.g., average pooling) can thus be sub-optimal for transformers, as shown by our experiments. In this paper, we propose Token Pooling to downsample tokens in vision transformers. We take a new perspective --- instead of assuming tokens form a regular grid, we treat them as discrete (and irregular) samples of a continuous signal. Given a target number of tokens, Token Pooling finds the set of tokens that best approximates the underlying continuous signal. We rigorously evaluate the proposed method on the standard transformer architecture (ViT/DeiT), and our experiments show that Token Pooling significantly improves the computation-accuracy trade-off without any further modifications to the architecture. On ImageNet-1k, Token Pooling enables DeiT-Ti to achieve the same top-1 accuracy while using 42% fewer computations.

Related Material

[pdf] [supp]
@InProceedings{Marin_2023_WACV, author = {Marin, Dmitrii and Chang, Jen-Hao Rick and Ranjan, Anurag and Prabhu, Anish and Rastegari, Mohammad and Tuzel, Oncel}, title = {Token Pooling in Vision Transformers for Image Classification}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {12-21} }