Dynamic Typography: Bringing Text to Life via Video Diffusion Prior


🎡 We recommend watching the video with sound on 🎡

Abstract

Text animation serves as an expressive medium, transforming static communication into dynamic experiences by infusing words with motion to evoke emotions, emphasize meanings, and construct compelling narratives. Crafting animations that are semantically aware poses significant challenges, demanding expertise in graphic design and animation. We present an automated text animation scheme, termed "Dynamic Typography", which deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. The animation is represented by a canonical field that aggregates the semantic content and a deformation field that applies per-frame motion to deform the canonical shape. Two fields are jointly optimized by the priors from a large pretrained text-to-video diffusion model using score-distillation loss with designed regularization, encouraging the video coherence with the intended textual concept while maintaining legibility and structural integrity throughout the animation process. We demonstrate the generalizability of our approach across various text-to-video models and highlight the superiority of our methodology over baselines. Through quantitative and qualitative evaluations, we demonstrate the effectiveness of our framework in generating coherent text animations that faithfully interpret user prompts while maintaining readability.

Gallery

How does it work?

The original input letter is initialized as a set of connected cubic BΓ©zier curves, represented by a set of control points. Our method predicts a displacement for each control point at each frame.
An overview of the framework. We represent the animation into a Canonical Field that aggregates the semantic content and a Deformation Field that applies per-frame motion to deform the Canonical Shape, both implemented as coordinate-based MLPs. These fields are jointly optimized by the video prior 𝐿𝑆𝐷𝑆 from frozen pre-trained video foundation model using Score Distillation Sampling, under regularization on legibility 𝐿𝑙𝑒𝑔𝑖𝑏𝑖𝑙𝑖𝑑𝑦 and structure preservation πΏπ‘ π‘‘π‘Ÿπ‘’π‘π‘‘π‘’π‘Ÿπ‘’.

Comparison

We compare our method with three baseline models: two pixel-based models (t2v model Gen-2, i2v model DynamiCrafter) and one vector-based animation model (LiveSketch). For text-to-video generation, we append the prompt with β€œwhich looks like a letter Β§,” where Β§ represents the specific letter to be animated. In the image-to-video case, we use the stylized letter generated by the word-as-image as the conditioning image. Within the vector-based scenario, we utilize LiveSketch as a framework to animate vector images. To ensure a fair comparison, we condition the animation on the stylized letter generated by the word-as-image as well
A camel walks steadily across the desert
input
Gen-2
DynamiCrafter
LiveSketch
Ours
A man doing exercise by lifting two dumbbells in both hands
input
Gen-2
DynamiCrafter
LiveSketch
Ours
Two people kiss each other, one holding the others chin with his hand
input
Gen-2
DynamiCrafter
LiveSketch
Ours
A fat swan is swimming elegantly and stretching its neck on the water
input
Gen-2
DynamiCrafter
LiveSketch
Ours
A hand holding a monocular telescope turns towards the camera
input
Gen-2
DynamiCrafter
LiveSketch
Ours

Ablation Study

We conduct ablation study to analyze different main components of our model. Without the perceptual regularization, the canonical shape will deviate significantly from the original letter's shape, struggling to preserve legibility. Removing learnable canonical shape or mesh-based structure preservation leads to abrupt appearance change in adjacent frames and severe artifacts.

Canonical Shape

InputWith Canonical ShapeWithout Canonical Shape
A bullfighter holds the corners of a red cape in both hands and waves it
A camel walks steadily across the desert
A fat swan is swimming elegantly and stretching its neck on the water

Legibility Regularization

InputWith Legibility RegularizationWithout Legibility Regularization
A bullfighter holds the corners of a red cape in both hands and waves it
A man doing exercise by lifting two dumbbells in both hands
A large and a small hand together make a heart shape

Structure Preservation Regularization

InputWith Structure Preservation Regularization Without Structure Preservation Regularization
A bullfighter holds the corners of a red cape in both hands and waves it
A butterfly is flying sideways and waves its two wings
A knight draws his sword, pointing it forward, ready for battle

Generalizability

Generalizability over different text-to-video models

InputModelsAnimations
A knight draws his sword, pointing it forward, ready for battle
ModelScope
AnimateDiff
ZeroScope

Generalizability over different fonts

PromptFontsAnimations
Two men shaking hands with each other in a friendly manner
KaushanScript-Regular
Segoe Print
Roboto-Bold

Generalizability over different prompts

Input letterPromptsAnimations
A couple is walking hand in hand, with the girl following the boy
A couple is walking hand in hand. The video shows their whole bodies
A family walks together. Father and mother hold their child's hand

Generalizability over different languages

Input letterPromptsAnimations
A girl is walking, following a boy
A hooked fish swims vigorously, trying to break free

Appendix

Ablation: annealed based frequency encoding

Input letterPromptsWith freq.Without freq.
A bullfighter holds the corners of a red cape in both hands and waves it
A butterfly is flying sideways and waves its two wings

Effect analysis of the control points

InputControl points numberCanonical ShapeAnimations
A police officer holds a pistol, rushing around and aiming to shoot the criminal
75
204
420

Failure Case

InputCanonical ShapeAnimations
The rocket has been launched, soaring into the sky

GPT4-V as Dynamic Typography Designer

Input letterPromptsAnimations
A cat curls up to sleep
A soccer player winding up for a powerful kick
A snake slithers through grass