VOSR: A Vision-Only Generative Model for Image Super-Resolution

Wu, Rongyuan; Sun, Lingchen; Zhang, Zhengqiang; Kong, Xiangtao; Zhao, Jixin; Wang, Shihao; Zhang, Lei

Rongyuan Wu, Lingchen Sun, Zhengqiang Zhang, Xiangtao Kong, Jixin Zhao, Shihao Wang, Lei Zhang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 16311-16321

Abstract

Large-scale pre-trained text-to-image (T2I) diffusion models, such as Stable Diffusion, can be finetuned for image super-resolution (SR) with highly realistic details. While impressive, pre-training such multi-modal models demands billions of high-quality text-image pairs and substantial computational resources, despite that SR is fundamentally an image-to-image (I2I) task. This raises a critical question: do we truly need multi-modal priors and billion-scale text-image data to solve a purely vision task? In this paper, we propose **VOSR**, a **V**ision-**O**nly **S**uper-**R**esolution framework that eliminates the need for textual priors and multi-modal pretraining. We identify two key limitations in previous image-based, uni-modal diffusion models: limited visual semantic guidance and unstable unconditional training. To this end, we leverage a pretrained vision encoder to inject semantic cues, and introduce a relaxed unconditional objective that partially uses the low-quality condition to stabilize training. To accelerate inference, we adopt a modified shortcut model for one-step SR with minimal quality degradation. VOSR is trained from scratch with significantly less data and a lower computational cost compared to T2I-based diffusion models. However, VOSR achieves comparable or even better performance than state-of-the-art T2I-tuned SR methods on both synthetic and real-world benchmarks, demonstrating its potential as a scalable and competitive alternative for generative SR. Codes and models will be made publicly available.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wu_2026_CVPR, author = {Wu, Rongyuan and Sun, Lingchen and Zhang, Zhengqiang and Kong, Xiangtao and Zhao, Jixin and Wang, Shihao and Zhang, Lei}, title = {VOSR: A Vision-Only Generative Model for Image Super-Resolution}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {16311-16321} }