Swin on Axes: Extending Swin Transformers to Quadtree Image Representations

Marc Oliu, Kamal Nasrollahi, Sergio Escalera, Thomas B. Moeslund; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2024, pp. 193-201

Abstract


n recent years, Transformer models have revolutionized machine learning in general. While this has resulted in impressive results in the field of Natural Language Processing, Computer Vision quickly stumbled upon computation and memory problems due to the high resolution and dimensionality of the input data. This is particularly true for video, where the number of tokens increases cubically relative to the frame and temporal resolutions. A first approach to solve this was Vision Transformers, which introduce a partitioning of the input into embedded grid cells, lowering the effective resolution. More recently, Swin Transformers introduced a hierarchical scheme that brought the concepts of pooling and locality to transformers in exchange for much lower computational and memory costs. This work proposes a reformulation of the latter that views Swin Transformers as regular Transformers applied over a quadtree representation of the input, intrinsically providing a wider range of design choices for the attentional mechanism. Compared to similar approaches such as Swin and MaxViT, our method works on the full range of scales while using a single attentional mechanism, allowing us to simultaneously take into account both dense short range and sparse long-range dependencies with low computational overhead and without introducing additional sequential operations, thus making full use of GPU parallelism.

Related Material


[pdf]
[bibtex]
@InProceedings{Oliu_2024_WACV, author = {Oliu, Marc and Nasrollahi, Kamal and Escalera, Sergio and Moeslund, Thomas B.}, title = {Swin on Axes: Extending Swin Transformers to Quadtree Image Representations}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2024}, pages = {193-201} }