SRA 2: Variational Autoencoder Self-Representation Alignment  for Efficient Diffusion Training

Wang, Mengmeng; Jiang, Dengyang; Li, Liuzhuozheng; Lin, Yucheng; Shen, Guojiang; Kong, Xiangjie; Liu, Yong; Dai, Guang; Wang, Jingdong

SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training

Mengmeng Wang, Dengyang Jiang, Liuzhuozheng Li, Yucheng Lin, Guojiang Shen, Xiangjie Kong, Yong Liu, Guang Dai, Jingdong Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 32978-32987

Abstract

Denoising-based diffusion transformers, despite their strong generation performance, suffer from inefficient training convergence. Existing methods addressing this issue, such as REPA (relying on external representation encoders) or SRA (requiring dual-model setups), inevitably incur heavy computational overhead during training due to external dependencies. To tackle these challenges, this paper proposes SRA 2, a lightweight intrinsic self-representation alignment framework for efficient diffusion training. SRA 2 leverages off-the-shelf pre-trained Variational Autoencoder (VAE) features: their reconstruction property ensures inherent encoding of visual priors like rich texture details, structural patterns, and basic semantic information. Specifically, SRA 2 aligns the intermediate latent features of diffusion transformers with VAE features via a lightweight projection layer, supervised by a feature alignment loss. This design accelerates training without extra representation encoders or dual-model maintenance, resulting in a simple yet effective pipeline. Extensive experiments demonstrate that SRA 2 improves both generation quality and training convergence speed compared to vanilla diffusion transformers, matches or outperforms state-of-the-art acceleration methods, and incurs merely 4% extra GFLOPs with zero additional cost for external guidance models.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wang_2026_CVPR, author = {Wang, Mengmeng and Jiang, Dengyang and Li, Liuzhuozheng and Lin, Yucheng and Shen, Guojiang and Kong, Xiangjie and Liu, Yong and Dai, Guang and Wang, Jingdong}, title = {SRA 2: Variational Autoencoder Self-Representation Alignment for Efficient Diffusion Training}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {32978-32987} }