Scaling Parallel Sequence Models to Vision Foundation Models

Jiang, Yitong; McCarthy, Collin; Wang, Hongjun; Ye, Hanrong; Dou, Qi; Xue, Tianfan; Gu, Jinwei; Kautz, Jan; Yin, Hongxu; Molchanov, Pavlo; Liu, Sifei

Yitong Jiang, Collin McCarthy, Hongjun Wang, Hanrong Ye, Qi Dou, Tianfan Xue, Jinwei Gu, Jan Kautz, Hongxu Yin, Pavlo Molchanov, Sifei Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 41332-41341

Abstract

Scaling vision foundation models is constrained by the quadratic complexity of self-attention. Although subquadratic attention alternatives like linear attention variants and state-space models successfully reduce the model complexity, they typically serialize images into 1D token sequences, compromising spatial coherence and efficiency. Generalized Spatial Propagation Networks (GSPN) offer a linear-time alternative that propagates context directly on the 2D grid via line-scan propagation and removes positional embeddings, yet the original design hits GPU-scaling limits: growing batch/channels saturate SM concurrency, serializing scans, and spiking latency. We introduce Compact GSPN (C-GSPN), a ViT block that compresses the propagation space to preserve accuracy while cutting propagation latency by nearly 10x. We further improve efficiency with lightweight projections and fused CUDA kernels. To enable large-scale pretraining, we adopta two-stage cross-operator distillation strategy that combines layer-wise supervision with end-to-end alignment. In a representative 1K configuration (batch 32, C=1152), C-GSPN achieves up to 2x speedup, maintains competitive zero-shot accuracy, and improves segmentation by +2.1%. Extensive experiments and ablations show that the proposed compression and two-stage distillation are criticalfor strong transfer while substantially reducing compute, enabling the first extension of a subquadratic operator to foundation-scale (CLIP-style) vision pretraining.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Jiang_2026_CVPR, author = {Jiang, Yitong and McCarthy, Collin and Wang, Hongjun and Ye, Hanrong and Dou, Qi and Xue, Tianfan and Gu, Jinwei and Kautz, Jan and Yin, Hongxu and Molchanov, Pavlo and Liu, Sifei}, title = {Scaling Parallel Sequence Models to Vision Foundation Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {41332-41341} }