-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Belagali_2026_CVPR, author = {Belagali, Varun and Yellapragada, Srikar and Graikos, Alexandros and Kapse, Saarthak and Li, Zilinghan and Nandi, Tarak Nath and Madduri, Ravi K and Prasanna, Prateek and Saltz, Joel and Samaras, Dimitris}, title = {Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {2886-2896} }
Gen-SIS: Generative Self-augmentation Improves Self-supervised Learning
Abstract
Self-supervised learning (SSL) methods have emerged as strong visual representation learners by training an image encoder to maximize similarity between features of different views of the same image. To perform this view-invariance task, current SSL algorithms rely on hand-crafted augmentations, such as random cropping and color jittering, to create multiple views of an image. Recently, generative diffusion models were shown to improve SSL by providing a wider range of data augmentations. However, these diffusion models usually require pre-training on large-scale image-text datasets, which might not be available for many specialized domains like histopathology. In this work, we introduce Gen-SIS, a diffusion-based augmentation technique trained exclusively on unlabeled image data, eliminating any reliance on external sources of supervision such as text captions. We first train a vanilla SSL encoder on a dataset using only hand-crafted augmentations. We then train a diffusion model conditioned on embeddings from that SSL encoder. Once trained, this diffusion model can synthesize diverse views of a source image when conditioned on its embedding. Leveraging the ability to interpolate in the encoder latent space, we introduce a novel pretext task: disentangling the two source images from an interpolated synthetic image. We show that these `self-augmentations', i.e., generative augmentations based on the vanilla SSL encoder embeddings, paired with our disentanglement pretext task, facilitate the training of stronger SSL encoders. We validate Gen-SIS by demonstrating performance gains across various downstream tasks in natural images, which are generally object-centric, and digital histopathology images, which are typically context-based. Furthermore, we show Gen-SIS's effectiveness across multiple SSL methods and encoder variants, highlighting its broad applicability.
Related Material

