-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Hu_2025_CVPR, author = {Hu, Zijing and Zhang, Fengda and Chen, Long and Kuang, Kun and Li, Jiahui and Gao, Kaifeng and Xiao, Jun and Wang, Xin and Zhu, Wenwu}, title = {Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {23604-23614} }
Towards Better Alignment: Training Diffusion Models with Reinforcement Learning Against Sparse Rewards
Abstract
Diffusion models have achieved remarkable success in text-to-image generation. However, their practical applications are hindered by the misalignment between generated images and corresponding text prompts. To tackle this issue, reinforcement learning (RL) has been considered for diffusion model fine-tuning. Yet, RL's effectiveness is limited by the challenge of sparse reward, where feedback is only available at the end of the generation process. This makes it difficult to identify which actions during the denoising process contribute positively to the final generated image, potentially leading to ineffective or unnecessary denoising policies. To this end, this paper presents a novel RL-based framework that addresses the sparse reward problem when training diffusion models. Our framework, named \text B ^2\text -DiffuRL , employs two strategies: **B**ackward progressive training and **B**ranch-based sampling. For one thing, backward progressive training focuses initially on the final timesteps of the denoising process and gradually extends the training interval to earlier timesteps, easing the learning difficulty associated with sparse rewards. For another, we perform branch-based sampling for each training interval. By comparing the samples within the same branch, we can identify how much the policies of the current training interval contribute to the final image, which helps to learn effective policies instead of unnecessary ones. \text B ^2\text -DiffuRL is compatible with existing optimization algorithms. Extensive experiments demonstrate the effectiveness of \text B ^2\text -DiffuRL in improving prompt-image alignment and maintaining diversity in generated images. The code for this work is available.
Related Material