-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Zhang_2026_CVPR, author = {Zhang, Zechuan and Chen, Zhenyuan and Yang, Zongxin and Yang, Yi}, title = {Are Image-to-Video Models Good Zero-Shot Image Editors?}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {2090-2103} }
Are Image-to-Video Models Good Zero-Shot Image Editors?
Abstract
Large-scale video diffusion models exhibit strong world-simulation and temporal reasoning capabilities, yet their potential as zero-shot image editors remains underexplored. We present \ifedit IF-Edit (Image Edit by Generating Frames), a tuning-free framework that repurposes pre-trained image-to-video diffusion models for instruction-driven image editing. \ifedit IF-Edit addresses three core obstacles--prompt misalignment, redundant temporal latents, and blurry late-stage frames--via: (1) a Chain-of-Thought Prompt Enhancement module that reformulates static editing instructions into temporally grounded reasoning prompts; (2) a Temporal Latent Dropout strategy that compresses frame latents after the expert-switch point, accelerating denoising while preserving global semantics and temporal coherence; and (3) a Self-Consistent Post-Refinement step that refines the sharpest late-stage frame through a brief still-video trajectory, leveraging the video prior for sharper and more faithful results. Extensive experiments across four public benchmarks--covering non-rigid deformations, physical and temporal reasoning, and general instruction editing--show that \ifedit IF-Edit achieves strong performance on non-rigid and reasoning-centric tasks while remaining competitive on general-purpose edits. Our study offers a systematic view of video diffusion models as image editors, revealing their unique strengths, limitations, and a simple recipe for unified video-image generative reasoning.
Related Material

