-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Lu_2025_ICCV, author = {Lu, Yanzuo and Ren, Yuxi and Xia, Xin and Lin, Shanchuan and Wang, Xing and Xiao, Xuefeng and Ma, Andy J. and Xie, Xiaohua and Lai, Jian-Huang}, title = {Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {16818-16829} }
Adversarial Distribution Matching for Diffusion Distillation Towards Efficient Image and Video Synthesis
Abstract
Distribution Matching Distillation (DMD) is a promising score distillation technique that compresses pre-trained teacher diffusion models into efficient one-step or multi-step student generators.Nevertheless, its reliance on the reverse Kullback-Leibler (KL) divergence minimization potentially induces mode collapse (or mode-seeking) in certain applications.To circumvent this inherent drawback, we propose Adversarial Distribution Matching (ADM), a novel framework that leverages diffusion-based discriminators to align the latent predictions between real and fake score estimators for score distillation in an adversarial manner.In the context of extremely challenging one-step distillation, we further improve the pre-trained generator by adversarial distillation with hybrid discriminators in both latent and pixel spaces.Different from the mean squared error used in DMD2 pre-training, our method incorporates the distributional loss on ODE pairs collected from the teacher model, and thus providing a better initialization for score distillation fine-tuning in the next stage.By combining the adversarial distillation pre-training with ADM fine-tuning into a unified pipeline termed DMDX, our proposed method achieves superior one-step performance on SDXL compared to DMD2 while consuming less GPU time.Additional experiments that apply multi-step ADM distillation on SD3-Medium, SD3.5-Large, and CogVideoX set a new benchmark towards efficient image and video synthesis.
Related Material
