Vector Prism: Animating Vector Graphics by Stratifying Semantic Structure

Anonymous Authors
CVPR 2026 (Review Version)
I want the emoji to look to the left and right.
I want the elements smoothly pop up in a lively manner.
I want the compass needle to quickly spin around once.
I want the buttons to bounce in one by one.
I want the scene to turn to night.
I want the python logo bounce in and meet in the middle.
I want the pytorch logo to bounce and change colors.
I want doughnut to bounce up and sprinkles fall in

Figure 1: Animations generated by Vector Prism. Although provided in mp4, the original files are in CSS animations.

Abstract

Scalable Vector Graphics (SVG) are central to modern web design, and the demand to animate them continues to grow as web environments become increasingly dynamic. Yet automating the animation of vector graphics remains challenging for vision–language models (VLMs) despite recent progress in code generation and motion planning. VLMs routinely mis-handle SVGs, since visually coherent parts are often fragmented into low-level shapes that offer little guidance of which elements should move together. In this paper, we introduce a framework that recovers the semantic structure required for reliable SVG animation and reveals the missing layer that current VLM systems overlook. This is achieved through a statistical aggregation of multiple weak part predictions, allowing the system to stably infer semantics from noisy predictions. By reorganizing SVGs into semantic groups, our approach enables VLMs to produce animations with far greater coherence. Our experiments demonstrate substantial gains over existing approaches, suggesting that semantic recovery is the key step that unlocks robust SVG animation and supports more interpretable interactions between VLMs and vector graphics.

1. Introduction

Scalable Vector Graphics (SVG) has become increasingly central to modern web experiences, prized for its portability across devices and infinite scalability without quality loss. This popularity is driven by their vector-based design, which describes graphics through geometric primitives rather than pixels, resulting in compact and resolution-independent files. As modern web interfaces evolve toward dynamic and interactive experiences, the demand for expressive animation techniques has become essential, since SVG animations can deliver rich visual motion where videos would be prohibitively heavy for web delivery.

Recent advances in vision-language models (VLMs) offer a tempting possibility, which is generating animations simply by instructing a VLM given the SVG file. At first glance, this seems to be straightforward, since modern vision-language models can already plan animation sequences and generate code. In practice, VLM-generated SVG animations rarely succeed, often resulting in visually broken animations. The problem lies not in the planning or coding capabilities, but in how SVGs are structured, as SVGs are optimized for rendering efficiency rather than semantic clarity. For example, as seen in Figure 2, visually coherent elements (e.g., bunny ears and nose) are often fragmented or grouped by draw order, obscuring the higher-level semantics needed for animation.

SVG Structure Comparison
Figure 2: Unstructured SVG contains fragmented elements and unclear tags, while structured SVG organizes parts with descriptive tags, ensuring alignment between SVG syntax and user instructions.

In this paper, we address the overlooked step of restructuring SVGs so that vision-language models can reason about meaningful parts during animation. Our aim is to reveal an internal structure for SVGs that allows a model to reference semantic units and attach motion to correct semantic units. The native SVG hierarchy rarely provides this structure, which motivates a method that can reliably recover the semantics required for animation.

We introduce Vector Prism, a framework that performs this recovery by stratifying noisy visual cues into coherent semantic groups, much like a prism for vector graphics. Each SVG primitive (i.e., basic shapes) is rendered through several focused views (e.g., highlighting, isolation, zoom-in, outlining, and bounding boxes) and a VLM predicts its semantic meaning, producing a set of weak, tentative semantic labels. Instead of aggregating these predictions using simple majority voting, Vector Prism interprets these predictions through the lens of a statistical inference process. Specifically, our method analyzes agreement patterns across weak labels and infers the underlying semantic signal with high stability. A Bayes decision rule then selects labels that minimize expected classification error and recover the most plausible true part structure.

These refined labels form the basis for the final stage, where Vector Prism restructures SVG primitives into coherent, animation-ready hierarchies. This restructuring bridges the gap between the visual semantics of the artwork and the syntactic organization of the SVG file, aligning the representation with how VLMs perceive and manipulate visual concepts. As a result, VLMs can animate graphics at the level of meaningful parts rather than low-level shapes, producing motions that are both visually stable and semantically consistent.

Our contributions are threefold:

Method Overview
Figure 3: (a) Animation pipeline overview. We first create a detailed animation plan, then create the animation code for the structured SVG. (b) Vector Prism overview. We collect agreement patterns of response from different rendering methods.

2. Related Work

Vector Graphics Animation

One line of work generates or animates vector graphics by optimizing vector (or animation) parameters using gradients from pre-trained image/video diffusion priors, typically via score distillation sampling (SDS). Since the SDS objective acts on rasterized renderings rather than vector structure, it encourages appearance preserving changes and resists large part rearrangements that animation often needs. Without explicit temporal regularization, the optimization often settles into short repetitive motions with visible jitter.

Another active stream fine-tunes LLMs to directly produce vector graphics parameters or animation commands, enabled by large paired datasets of vector graphics and human instructions. Because LLMs carry little understanding for vector geometry and scene hierarchies, performance scales primarily with data, often requiring millions of examples. Orthogonal to data scaling, we focus on recovering element-level semantics in the input SVGs, so that downstream LLMs/VLMs can robustly plan motions and generalize to diverse, in-the-wild graphics.

Semantic Understanding of Vector Graphics

Raw symbolic representations of vector graphics (e.g., shape coordinates and translation matrices) are designed for rendering and programmatic manipulation rather than human reading or editing, which makes them inherently difficult for humans to directly inspect and understand. This limitation has been highlighted as the research community seeks to teach LLMs, which often rely on perceptual cues similar to humans, to understand and edit vectorized formats.

Although VLMs tend to understand rasterized renderings of simple and well-separated vector graphics, we find that they quickly fail to understand individual parts of complex real-world cases. In this paper, we take a significant step in vector graphics understanding by aiming not only to target complex real-world SVGs but to identify and label individual SVG primitives, which is exactly the capability required for animation. To do so, we present a statistical inference framework that makes unreliable and noisy LLM outputs into robust decisions, enabling animation possible even without finetuning VLMs.

3. Method

3.1 The Overview

As illustrated in Figure 3, the pipeline begins with animation planning (§3.2), where a vision–language model (VLM) interprets the visual content and generates a detailed scheme of how each semantic components should be animated. It then proceeds to semantic wrangling (§3.3), where the SVG is restructured into a semantically meaningful and animatable form through a statistical inference, and finally to animation generation (§3.4), which produces executable animation code.

The planning stage provides semantic understanding of the scene, while the animation stage operates directly on the SVG code. Our restructuring stage bridges this gap by injecting semantic meaning into the SVG, enriching its structure with interpretable tags that connect visual reasoning to code-level representation. The core contribution of our approach lies in this stage, where we introduce a statistical inference framework that makes reliable semantic inferences from inherently noisy model predictions.

3.2 Animation Planning

The planning stage uses a VLM to reason about the scene at a semantic level. The SVG is first rendered into a raster image so it can be understood by the VLM, which offers strong visual signals compared to the original SVG code representations. The VLM is then instructed to produce high-level animation plans given the rendered image and the user's animation description, identifying which semantic components should move and how they relate to one another. For example, when prompted to "make the sun rise," the VLM identifies the circular yellow region as the sun and the blue background as the sky, proposing that the sun should move upward while the sky gradually brightens.

Since VLMs lack an understanding of the symbolic structures (i.e., SVG syntax), they have no way to directly implement those plans into the SVG's syntactic hierarchy. Bridging this semantic–syntactic divide is precisely the role of the restructuring stage.

3.3 Vector Prism

Problem Setup and Notations

Given a SVG file, let $\mathcal{X}$ be the set of all the primitives, which are basic shapes such as <path>, <rect>, <circle>, <ellipse>, <line>, <polyline>, and <polygon>. Every primitive should fall into one of the semantic categories $\mathcal{Y}=\{1,\dots,k\}$ fixed from the planning stage. For each primitive $\mathbf{x}\in\mathcal{X}$ there is an unknown true label $y(\mathbf{x})\in\mathcal{Y}$ that we want to predict.

For a SVG primitive to be visually interpreted by a VLM, it first needs to be rendered into a raster image. Deciding how to render $\mathbf{x}$ is non-trivial, and thus we use $M$ different rendering methods indexed by $i\in\{1,\dots,M\}$. This provides complementary views of the target primitive, which helps us safely collect different weak labels of the same primitive. Examples include highlight on the original canvas, a tight bounding-box overlay, a zoomed crop, and isolation on a blank background. When we use method $i$ to render a primitive $\mathbf{x}$, the VLM returns a label $s_i(\mathbf{x})\in\mathcal{Y}$.

We assume a Dawid-Skene model for each rendering method:

$$\Pr[s_i=\ell]= \begin{cases} p_i, & \ell=y,\\ \frac{1-p_i}{k-1}, & \ell\neq y. \end{cases}$$

where a rendering method $i$ has accuracy $p_i$ and fails uniformly over the other $k-1$ labels. We will recover the reliability $p_i$ of each strategy.

From Pairwise Agreement to Reliability

Under the model above, VLM responses from two different renderings $i$ and $j$ would agree either because both are correct or because both pick the same wrong label. Thus, the probability of agreement is:

$$\mathbf{A}_{ij}=\Pr[s_i=s_j]=p_i p_j+\frac{(1-p_i)(1-p_j)}{k-1}.$$ (1)

Since two random guesses could still agree by chance with probability $1/k$, we write $\delta_i=p_i-\tfrac{1}{k}$ and:

$$\mathbf{A}_{ij}=\frac{1}{k}+\frac{k}{k-1}\,\delta_i\delta_j \quad (i\neq j)$$ (2)

to separate chance from skill. Subtracting the chance term gives a centered agreement matrix $\mathbf{B}$ with $\mathbf{B}_{ij}=\mathbf{A}_{ij}-\tfrac{1}{k}$ for $i\neq j$ and $\mathbf{B}_{ii}=0$. Matrix $\mathbf{B}$ is rank one on the off-diagonals:

$$\mathbb{E}[\mathbf{B}]=\frac{k}{k-1}\,\boldsymbol{\delta}\boldsymbol{\delta}^\top,$$ (3)

which is the outer product of $\boldsymbol{\delta}$. Let $\lambda$ and $\mathbf{v}$ be the top eigenvalue and eigenvector of $\mathbf{B}$, then:

$$\boldsymbol{\delta}=\sqrt{\frac{\lambda(k-1)}{k}}\,\mathbf{v}, \qquad p_i=\frac{1}{k}+\delta_i,$$

with the sign of $\mathbf{v}$ chosen so that $\sum_i \hat\delta_i\ge 0$. In this way, given the agreement matrix $\mathbf{A}$, we can recover the initially unknown reliability of each VLM response $i$.

The agreement matrix $\mathbf{A}$ can be empirically estimated by a burn-in pass, traversing the SVG primitives and collecting the agreement patterns:

$$\hat{\mathbf{A}}_{ij}=\frac{1}{|\mathcal{X}|}\sum_{\mathbf{x}\in\mathcal{X}}\mathbf{1}[s_i(\mathbf{x})=s_j(\mathbf{x})].$$

Following Equations (2) and (3), we can obtain $\hat{\boldsymbol{\delta}}$ and consequently a reliability $\hat{p}_i$ for each rendering method.

From Reliability to Semantic Labels

With reliabilities $\hat{p}_i$ in hand, we score each candidate label $y\in\mathcal{Y}$ for a given element using Bayes' decision rule with a uniform prior:

$$\log P(y\mid s)=\mathrm{const} +\sum_{i:\, s_i=y}\log \hat{p}_i +\sum_{i:\, s_i\ne y}\log\frac{1-\hat{p}_i}{k-1}.$$

This is equivalent to a weighted vote with:

$$w_i=\log\frac{(k-1)\hat{p}_i}{1-\hat{p}_i}, \qquad \hat{y}=\arg\max_y \sum_{i:\, s_i=y} w_i.$$

When all VLMs are equally reliable, all $w_i$ are equal and the decision rule reduces to majority voting. A probability bound comparing this rule to majority voting and showing a strict advantage whenever VLM reliabilities differ, is provided in the Appendix.

From Semantic Labels to a New Structure

Once reliable semantic labels are available, restructuring the SVG becomes a straightforward step that turns meaning into organization without changing appearance. Although SVGs are usually grouped for rendering efficiency, not semantics, this step only needs to reorganize existing elements rather than reinterpret them. For example, shapes that share similar transformations may be grouped together even if they represent different objects, causing unrelated parts to move together. With correct labels, this can be easily fixed.

Our restructuring algorithm attaches each label as a class attribute and flattens the hierarchy so that all visual properties are applied directly to each primitive, preserving appearance. Primitives are then regrouped by label while maintaining the original paint order. Overlaps between different labels are checked to prevent rendering changes. The resulting SVG looks identical but is organized into meaningful parts ready for animation. Full pseudocode are provided in the Appendix and the code will be released upon acceptance.

3.4 Animation Generation

The LLM is instructed to animate the restructured SVG file according to the animation plan using CSS. While the earlier pipeline steps do not restrict generating animations to the CSS markup type, CSS was chosen for its simplicity, and our method has the capability to extend to complex animations using JavaScript or specialized libraries.

Animation code can become lengthy, often exceeding the token generation limits of many models. To address this constraint, we adopt an iterative generation strategy, where CSS animations are generated separately for each semantic category, with previously completed animations retained in the context for subsequent generations. To prevent conflicting animations, we enforce strict animation rules that ensure mutual exclusivity between generated effects.

4. Experiments

4.1 Dataset

Our test dataset consists of 114 carefully curated animation instructions and SVG pairs, designed to test a variety of SVG animation techniques. The instruction set covers a broad range of animation tasks, from simple movements to complex actions such as 3D rotations and synchronized transitions. The SVG files were sourced from SVGRepo, ensuring a diverse collection of objects and scenes, including animals, logos, buildings, and natural elements like fire, clouds, and water. The goal of this dataset is to evaluate the performance of SVG animation tools and systems by providing clear, detailed animation instructions that simulate real-world use cases in web environments.

4.2 Baselines

AniClipart

AniClipart represents the optimization-based animation methods, which optimizes animation parameters such as keypoint movements, using the Score Distillation Sampling loss. While AniClipart does not output standard animation formats, it defines Bézier curves for keypoints within SVG files, enabling direct vector graphics animation.

GPT-5

GPT-5 is reported to have one of the best understanding of symbolic representation among LLMs. However, we observe that naive prompting of LLMs to generate animation code rarely produces meaningful motion. Therefore, we augment GPT-5 with the same high-level planning and animation generation pipeline employed in our framework to ensure fair comparison. In this configuration, GPT-5 generates CSS animations in vector format.

Video Generation Models

We include two video generation models, the open-sourced Wan2.2 14B model and OpenAI's Sora2 service. Although these models produce rasterized video output (.mp4) and cannot generate the desired vector files, we include them to cover a wide scope of animation generation technique, especially as these models demonstrate high performances in instruction following and video quality.

4.3 Implementation Details

We use GPT-5-nano, which is 25× more cost-efficient than GPT-5, as the underlying vision–language model for planning and semantic labeling, while GPT-5 is used for animation generation. Our semantic labeling stage is statistically robust to noise and operates with minimal computational overhead, enabling lightweight models to perform reliably without sacrificing accuracy. All SVG primitives are rendered at 512×512 resolution when given as a VLM input for analysis.

We do not share the agreement matrix across SVGs, since we find that the reliability of each rendering method can vary depending on the visual complexity and structure of the SVG. During the burn-in stage, where agreement patterns are collected, a single full pass over all primitives within each SVG provides a good balance between estimation stability and computational efficiency.

4.4 Quantitative Evaluation

We evaluate the generated animations using two instruction-following metrics and one perceptual quality metric. Following InternSVG, we measure the correspondence between animation instructions and rendered videos using a video-pretrained CLIP model, referred to as CLIP-T2V. To complement this, we introduce the GPT-T2V score, where GPT grades each video based on how accurately its motion follows the given instruction. Finally, we assess perceptual quality with DOVER, an off-the-shelf video quality assessment model that captures both technical fidelity and visual aesthetics.

Method CLIP-T2V GPT-T2V DOVER Vector
AniClipart 15.66 23.96 3.35
GPT-5 20.67 40.92 4.92
Wan 2.2 21.14 65.21 3.72
Sora 2 20.29 69.08 4.19
Ours 21.55 76.14 4.97

Table 1: Animation quality and instruction-following scores across different methods. The checkmark indicating whether each method generates vector-based animations.

As shown in Table 1, our method achieves the best scores across all metrics, demonstrating clear advantages in both motion realism and instruction faithfulness. This improvement comes from the ability to expose meaningful parts of the SVG prior to animation, allowing the model to attach coherent motions to relevant semantic parts.

4.5 Qualitative Analysis

AniClipart and GPT-5 often fail to produce meaningful motion since they lack explicit semantic understanding. These approaches interpret semantics implicitly, AniClipart through the diffusion prior and GPT-5 through internal representations, without explicit part labels or hierarchy. As a result, they tend to produce uniform motion across entire figures, leading to swaying or barely moving animations.

Video generation models, Wan 2.2 and Sora 2, generate richer motion than the above methods but often collapse into static frames or distorted scenes when given dynamic, animation-focused instructions such as "An opening scene of the SVG." Note that these are rasterized videos rather than vector graphics, which makes them unsuitable for web-based animation tasks where lightweight rendering is essential. In contrast, our method translates instructions into motion entirely through the language domain, avoiding the limitations of multimodal training and dataset dependence.

I want an opening animation for the SVG, starting from the bottom and moving up to the top.
AniClipart
GPT-5
Wan 2.2
Sora 2
Ours
I want the lightning bolt to glow softly and the raindrops to fade in and out gently.
AniClipart
GPT-5
Wan 2.2
Sora 2
Ours
I want the stars and planets first to emerge gently and then the rings to appear in a stroke effect.
AniClipart
GPT-5
Wan 2.2
Sora 2
Ours
I want the hexagon to appear first, and then the X sign to enter by spinning in.
AniClipart
GPT-5
Wan 2.2
Sora 2
Ours

Figure 4: Animations generated by each method.

4.6 User Study

To complement the quantitative evaluation, we conducted a user study to assess how well each generated animation aligns with the given instructions from a human perspective. A total of 760 pairwise comparisons were collected from 19 participants. In each trial, participants were shown two videos generated from the same instruction, each produced by a different method, and asked to select the one that better followed the instruction.

User Study Results
Figure 5: Human preference results comparing our method with baseline approaches. Pink segments represent preferences towards our method, and orange or purple segments represent the competing baseline.

5. Analysis

5.1 Encoding Efficiency of Vector-Based Animations

We demonstrate the effectiveness of vector-based animations by comparing the compression ratio compared to Sora 2 and the animation fidelity. Typically, as the raster video resolution increases for quality (e.g., from 480p to 720p), the file size increases and the video less compressed. In contrast, the SVG animations generated by our approach describe motion through compact, symbolic CSS keyframes applied to geometric primitives. The resulting file size is primarily dependent on the complexity of the SVG structure (number of primitives) and the length of the animation code, not the output resolution or frame rate.

Encoding Efficiency
Figure 6: Dual-axis bar chart comparing compression ratio (left y-axis) and animation fidelity (right y-axis). Compression ratios are depicted by solid bars, and Animation Fidelity is shown with hatched bars.

This leads to a significant improvement in encoding efficiency compared to video models like Wan 2.2 and Sora 2, which generate every pixel of the animation, even when a vector representation is possible. Sora 2, for instance, results in an average file size that is ×54 larger than those produced by our approach, with this gap widening as video resolution and duration increase. This makes our approach particularly well-suited for modern web environments, where lightweight assets are essential for fast loading times, responsive UI/UX, and reduced data consumption across networks.

5.2 Stability Compared to Majority Voting

Evaluating the quality of semantic groupings in SVGs is challenging without ground truth labels, yet crucial for understanding whether our statistical inference produces coherent clusters. We treat each semantic group as a cluster and measure clustering quality using the Davies-Bouldin index (DBI), a metric that quantifies the ratio of within-cluster scatter to between-cluster separation. We compute distances in the feature space of DINO v3, which provides semantically meaningful visual embeddings.

SVG files with their original, rendering-oriented groupings yield an average DBI of 33.8, reflecting the semantic incoherence of primitives grouped solely for drawing efficiency. Majority voting with the same multi-view rendering techniques improves this to 12.6, demonstrating that aggregating multiple views helps, but still produces noisy groupings. In contrast, Vector Prism achieves a DBI of 0.82, indicating near-perfect semantic clustering.

DBI Analysis
Figure 7: Example case of when Bayes decision rule can consistently make robust decisions even with noisy signals.

The key advantage of our approach over majority voting is illustrated in Figure 7. When one rendering method produces unreliable predictions, correct only by chance, majority voting treats it equally with other reliable methods. This equal weighting allows the weakest reliable responses to occasionally flip the predicted label for certain primitives, creating inconsistent groupings that fragment semantically coherent parts. Since animation quality depends on all primitives being correctly grouped, even a small fraction of mislabeled elements can break the visual logic of motion. By estimating reliability scores for each rendering method, Vector Prism consistently downweights noisy VLM responses throughout the entire labeling process, ensuring stability across the full set of primitives.

5.3 Failure Cases

Even with semantic groupings and a well-structured hierarchy, our method operates at the level of SVG primitives defined in the original file. We treat primitives as atomic units and do not subdivide or decompose them further, which limits animation flexibility when the input SVG lacks granularity. For example, as shown in Figure 8, the lightning shape is written as a single large <path> element, while the instruction requires this to "shatter into pieces." The method fails to animate this part of the instruction, as the pieces themselves do not exist as independent primitives.

Failure Case
Figure 8: Failure case. Since the lightning bolt is defined as a single atomic <path> primitive (left), our approach cannot execute the operation beyond the input SVG's granularity (right).

This limitation could be addressed if users can refine their SVG files using vectorization tools such as VTracer or recent image-to-SVG models, which generate SVGs with controllable levels of detail. Alternatively, future work could explore automatic primitive subdivision strategies that identify and split overly coarse elements based on the animation requirements.

6. Conclusion

In this paper, we introduced Vector Prism, a novel framework designed to overcome the critical semantic-syntactic gap that prevents modern vision-language models (VLMs) from successfully animating Scalable Vector Graphics (SVGs). Our core insight is that by enriching the native SVG structure with coherent semantic anchors, VLMs can reason about meaningful parts and reliably generate targeted motion. The foundation of our approach is a multi-view statistical inference mechanism utilizing the Dawid-Skene model, which effectively transforms noisy, weak predictions from a VLM into robust, high-confidence semantic labels, eliminating the need for extensive, domain-specific VLM fine-tuning.

Through rigorous quantitative and qualitative evaluations, we demonstrated that our method achieves unmatched improvements in animation quality and instruction fidelity, surpassing both existing vector animation techniques and state-of-the-art raster video generation models.

We believe that bridging the semantic-syntactic gap is a vital, generalizable step for unlocking the full potential of VLMs across various symbolic domains. Whether for vector graphics or for 3D assets and scenes, methods that align human semantic intent with machine-readable structure will significantly broaden the capabilities of language models, transforming them from passive code generators into robust, context-aware animation and design agents.

References

[References section would be populated with the complete bibliography from the LaTeX file]


Note on Video Paths: This HTML file expects video files to be organized in the following structure: