Token-Aligned Hierarchies for Lightweight Super-Resolution

Jinendra Malekar, Lingjia Shi, Peyton Chandarana, Ramtin Zand; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 3448-3457

Abstract


Windowed self-attention (WSA) has become a strong backbone for single-image super-resolution (SR), yet its high overhead often leads to latency inefficiency. We revisit Swin-style SR from a hierarchical perspective and introduce a token-aligned encoder-decoder built entirely with grouped and depthwise convolutions, replacing attention windows with efficient spatial mixing. Our architecture preserves the locality bias of WSA while substantially improving speed and stability. It incorporates (i) symmetry-preserving padding for consistent token partitioning, (ii) a token pyramid that expands channels through patch merging to aggregate broader context, and (iii) Token-Aligned Skip Fusion (TASF) for precise multi-scale feature reuse. Built upon the SwinIR hierarchy, our model attains both the relatively high reconstruction quality (PSNR 37.8 dB for x2) and the lowest latency among all compared methods, including faster inference than SwinIR-light while maintaining strong texture consistency and low memory usage. These results demonstrate that hierarchical, convolution-based modeling can match or surpass transformer performance at a fraction of the cost, making our design highly suitable for real-time and edge SR applications.

Related Material


[pdf]
[bibtex]
@InProceedings{Malekar_2026_CVPR, author = {Malekar, Jinendra and Shi, Lingjia and Chandarana, Peyton and Zand, Ramtin}, title = {Token-Aligned Hierarchies for Lightweight Super-Resolution}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {3448-3457} }