Zero-Shot Video Deraining with Video Diffusion Models

WACV 2026

Flawless AI LogoFlawless AI

Figure 1. Visual results on video deraining on real-world rain video from NTURain.

Figure 2. Example cases of rain. Top: Synthetic, Middle: Real, Bottom: Generated. Notably, synthetic rain often exhibits unrealistic repetitive patterns not present in real-world scenes. Rainy scenes generated by large diffusion models show much more realistic effects.

Figure 4. Comparison between DDIM and DDPM inversion on video data. The video DDIM inversion struggles with fully reconstructing the video and misses not only high-frequency details but also larger objects. The PSNR drop of video DDPM inversion is caused mostly by the VideoVAE (PSNR = 31.80) and numerical precision.

Figure 6. Results frorm real-world videos using different deraining approaches.

Figure 8. Ablation study of attention switching. From left to right: Input, without attention switching, with attention switching. Although both cases are capable of removing the rain well, the base model struggles with distortions and artifacts, such as missing objects, color shifts, and saturation.

Figure 9. Different rain prompts and their respective results. Left: Using diffusion model to generate a video based on the prompt. Right: Deraining results with different prompts. The generated video from “rain” shows no rain. When using mean of rain prompts, the generated video shows a clear rain pattern and some background, indicating the prompt is not fully disentangled. “light rain” shows excellent rain-background disentanglement, and over all it performs the best for deraining.

Figure 10. Additional real-world deraining results.

Figure 11. Results on GT-Rain. Compared with other baselines, our method generates more temporally consistent deraining.

Figure 13. Ablation study on different selection of blocks B for attention switching. From left to right: Input, No blocks used (Base), Blocks 0-4 and 15-30 (Ours), Blocks 15-30, Blocks 0-4. The proposed approach of using both the initial four blocks but also the later fifteen blocks for attention switching obtains the best results. This can be observed by looking at the distortions caused by other settings, when compared to the input image.

Figure 14. Comparison of video inversion results at different skip values ts. From left to right: Video SDEdit inversion, Video DDIM inversion, Video DDPM inversion. From top to bottom: ts = 0, ts = 25, ts = 40, ts = 50. Video SDEdit inversion loses the structure of the entire scene at ts = 0, while video DDIM inversion is able to retain some structure from the cars and the camera motion. Video DDPM inversion retains the scene with only a minor loss in high-frequency details. For higher skip values ts, the results improve. The PSNR is computed per frame and averaged together.

Figure 15. Selected frames from desnowed real-world videos. Base refers to the case without attention switching.

Figure 16. Visualization of different snow prompts. Left: The result generated with the prompt “snow” produces snow on the ground instead of a falling snow effect. Middle: In using the prompt “snowing”, the model generates falling snow. Right: Unlike the results in Fig. 5 for the prompt light, the same prompt affects snow generation differently. Note that the generated prompt is entangled with a snowy forest.