Figure 1. Visual results on video deraining on real-world rain video from NTURain.
Figure 2. Example cases of rain. Top: Synthetic, Middle: Real,
Bottom: Generated. Notably, synthetic rain often exhibits unrealistic
repetitive patterns not present in real-world scenes. Rainy
scenes generated by large diffusion models show much more realistic effects.
Figure 4. Comparison between DDIM and DDPM inversion on video data. The video DDIM inversion struggles with
fully reconstructing the video and misses not only high-frequency
details but also larger objects. The PSNR drop of video DDPM
inversion is caused mostly by the VideoVAE (PSNR = 31.80) and
numerical precision.
Figure 6. Results frorm real-world videos using different deraining approaches.
Figure 8. Ablation study of attention switching. From left to right: Input, without attention switching,
with attention switching. Although both cases are capable of removing the rain well, the base model struggles with
distortions and artifacts, such as missing objects, color shifts, and saturation.
Figure 9. Different rain prompts and their respective results. Left:
Using diffusion model to generate a video based on the prompt.
Right: Deraining results with different prompts. The generated
video from “rain” shows no rain. When using mean of rain
prompts, the generated video shows a clear rain pattern and some
background, indicating the prompt is not fully disentangled. “light
rain” shows excellent rain-background disentanglement, and over
all it performs the best for deraining.
Figure 11. Results on GT-Rain. Compared with other baselines, our method generates more temporally consistent deraining.
Figure 13. Ablation study on different selection of blocks B for attention switching. From left to right: Input, No blocks used (Base), Blocks 0-4 and 15-30 (Ours), Blocks 15-30, Blocks 0-4. The proposed approach of using both the initial four blocks but also the later
fifteen blocks for attention switching obtains the best results. This can be observed by looking at the distortions caused by other settings,
when compared to the input image.
Figure 14. Comparison of video inversion results at different skip values ts. From left to right: Video SDEdit inversion,
Video DDIM inversion, Video DDPM inversion. From top to bottom: ts = 0, ts = 25, ts = 40, ts = 50. Video SDEdit inversion loses the structure of the entire scene
at ts = 0, while video DDIM inversion is able to retain some structure from the cars and the camera motion. Video DDPM inversion
retains the scene with only a minor loss in high-frequency details. For higher skip values ts, the results improve. The PSNR is computed
per frame and averaged together.
Figure 15. Selected frames from desnowed real-world videos. Base refers to the case without attention switching.
Figure 16. Visualization of different snow prompts. Left: The result generated with the prompt “snow” produces snow on the ground
instead of a falling snow effect. Middle: In using the prompt “snowing”, the model generates falling snow. Right: Unlike the results in
Fig. 5 for the prompt light, the same prompt affects snow generation differently. Note that the generated prompt is entangled with a snowy
forest.