LATENTMAN: Generating Consistent Animated Characters using Image Diffusion Models

Abdelrahman Eldesokey, Peter Wonka; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7510-7519

Abstract


We propose a zero-shot approach for generating consistent videos of animated characters based on Text-to-Image (T2I) diffusion models. Existing Text-to-Video (T2V) methods are expensive to train and require large-scale video datasets to produce diverse characters and motions. At the same time their zero-shot alternatives fail to produce temporally consistent videos with continuous motion. We strive to bridge this gap and we introduce LatentMan that leverages existing text-based motion diffusion models to generate diverse continuous motions to guide the T2I model. To boost the temporal consistency we introduce the Spatial Latent Alignment module that exploits cross-frame dense correspondences that we compute to align the latents of the video frames. Furthermore we propose Pixel-Wise Guidance to steer the diffusion process in a direction that minimizes visual discrepancies between frames. Our proposed approach outperforms existing zero-shot T2V approaches in generating videos of animated characters in terms of pixel-wise consistency and user preference.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Eldesokey_2024_CVPR, author = {Eldesokey, Abdelrahman and Wonka, Peter}, title = {LATENTMAN: Generating Consistent Animated Characters using Image Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7510-7519} }