Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

ICCV 2025

Paper ID 6728

In this supplementary website we provide additional video results for the following cases:

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 3
Main Paper
Figure 2
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Figure 2 (Main Paper)

[Motivation of Progressve Growing]

We show that the reconstruction obtained by directly training MagViTv2 for 16× temporal compression leads to poor reconstruction quality for a 24fps video (middle). 𝑠𝑓=1 stands for frame subsampling factor = 1. However, we observed that the 4× temporal compression MagViTv2 can still accurately reconstruct a 6fps video by feeding the same 24fps video after subsampling frames by a factor of 4, 𝑠𝑓=4, (right). This observation implies that it is not necessarily the large motion that leads to worse reconstruction, but that training many downsampling (upsampling) layers of encoder (decoder) at once makes training difficult.

Ground-Truth

MagViTv2-16× (𝑠𝑓=1)

MagViTv2-4× (𝑠𝑓=4)