-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Wu_2026_CVPR, author = {Wu, Rongyuan and Sun, Lingchen and Zhang, Zhengqiang and Kong, Xiangtao and Zhao, Jixin and Wang, Shihao and Zhang, Lei}, title = {VOSR: A Vision-Only Generative Model for Image Super-Resolution}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {16311-16321} }
VOSR: A Vision-Only Generative Model for Image Super-Resolution
Abstract
Large-scale pre-trained text-to-image (T2I) diffusion models, such as Stable Diffusion, can be finetuned for image super-resolution (SR) with highly realistic details. While impressive, pre-training such multi-modal models demands billions of high-quality text-image pairs and substantial computational resources, despite that SR is fundamentally an image-to-image (I2I) task. This raises a critical question: do we truly need multi-modal priors and billion-scale text-image data to solve a purely vision task? In this paper, we propose **VOSR**, a **V**ision-**O**nly **S**uper-**R**esolution framework that eliminates the need for textual priors and multi-modal pretraining. We identify two key limitations in previous image-based, uni-modal diffusion models: limited visual semantic guidance and unstable unconditional training. To this end, we leverage a pretrained vision encoder to inject semantic cues, and introduce a relaxed unconditional objective that partially uses the low-quality condition to stabilize training. To accelerate inference, we adopt a modified shortcut model for one-step SR with minimal quality degradation. VOSR is trained from scratch with significantly less data and a lower computational cost compared to T2I-based diffusion models. However, VOSR achieves comparable or even better performance than state-of-the-art T2I-tuned SR methods on both synthetic and real-world benchmarks, demonstrating its potential as a scalable and competitive alternative for generative SR. Codes and models will be made publicly available.
Related Material

