Efficient Image Generation with Variadic Attention Heads

Walton, Steven; Hassani, Ali; Xu, Xingqian; Wang, Zhangyang; Shi, Humphrey

Steven Walton, Ali Hassani, Xingqian Xu, Zhangyang Wang, Humphrey Shi; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2025, pp. 3239-3250

Abstract

While the integration of transformers in vision models have yielded significant improvements on vision tasks they still require significant amounts of computation for both training and inference. Restricted attention mechanisms significantly reduce these computational burdens but come at the cost of losing either global or local coherence. We propose a simple, yet powerful method to reduce these trade-offs: allow the attention heads of a single transformer to attend to multiple receptive fields. We demonstrate our method utilizing Neighborhood Attention (NA) and integrate it into a StyleGAN based architecture for image generation. With this work, dubbed StyleNAT, we are able to achieve a FID of 2.05 on FFHQ, a 6% improvement over StyleGAN-XL, while utilizing 28% fewer parameters and 4x the throughput capacity. StyleNAT achieves the Pareto Frontier on FFHQ-256 and demonstrates powerful and efficient image generation on other datasets. Our code and model checkpoints are publicly available at: https://github.com/SHI-Labs/StyleNAT

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Walton_2025_CVPR, author = {Walton, Steven and Hassani, Ali and Xu, Xingqian and Wang, Zhangyang and Shi, Humphrey}, title = {Efficient Image Generation with Variadic Attention Heads}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3239-3250} }