Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

ICCV 2025

Paper ID 6728

In this supplementary website we provide additional video results for the following cases:

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 3
Main Paper
Figure 2
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Figure 5 (Main Paper)

[High-Rsolution Reconstruction]

To fit the tokenizer in memory for high resolution encoding and decoding, we need to tile the latent spatially. Tilting the latent before passing latent through decoder causes artifacts in the reconstructed video (middle). Layer-wise Spatial Tiling resolves the artifacts.

Ground-Truth

Spatial Tiling before Decoder

Layer-wise Spatial Tiling