SODA: Bottleneck Diffusion Models for Representation Learning

Drew A. Hudson, Daniel Zoran, Mateusz Malinowski, Andrew K. Lampinen, Andrew Jaegle, James L. McClelland, Loic Matthey, Felix Hill, Alexander Lerchner; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 23115-23127

Abstract


We introduce SODA a self-supervised diffusion model designed for representation learning. The model incorporates an image encoder which distills a source view into a compact representation that in turn guides the generation of related novel views. We show that by imposing a tight bottleneck between the encoder and a denoising decoder and leveraging novel view synthesis as a self-supervised objective we can turn diffusion models into strong representation learners capable of capturing visual semantics in an unsupervised manner. To the best of our knowledge SODA is the first diffusion model to succeed at ImageNet linear-probe classification and at the same time it accomplishes reconstruction editing and synthesis tasks across a wide range of datasets. Further investigation reveals the disentangled nature of its emergent latent space that serves as an effective interface to control and manipulate the produced images. All in all we aim to shed light on the exciting and promising potential of diffusion models not only for image generation but also for learning rich and robust representations. See our website at soda-diffusion.github.io.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Hudson_2024_CVPR, author = {Hudson, Drew A. and Zoran, Daniel and Malinowski, Mateusz and Lampinen, Andrew K. and Jaegle, Andrew and McClelland, James L. and Matthey, Loic and Hill, Felix and Lerchner, Alexander}, title = {SODA: Bottleneck Diffusion Models for Representation Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {23115-23127} }