Progressive Growing of Video Tokenizers for Temporally Compact Latent Spaces

ICCV 2025

Paper ID 6728

In this supplementary website we provide additional video results for the following cases:

  • First row: videos showing reconstruction results under different settings.
  • Second row: videos showing text-to-video geenrating results with different latent spaces.
  • Third row: videos corresponding to figures in the main paper.
Reconstruction Comparison
4X (Baselines)
Reconstruction Comparison
8X
Reconstruction Comparison
16X
Reconstruction Comparison
Overlapping Chunks

Text-to-Video Generation
16X Latent
Text-to-Video Generation
4X v/s 16X Latent
Text-to-Video Generation
16X Latent Long Video
Text-to-Video Generation
Overlapping Chunks

Figure 3
Main Paper
Figure 2
Main Paper
Figure 5
Main Paper
Figure 9
Main Paper

Reconstruction Comparison (4× Temporal Compression)

We show a comparison of reconstruction results of different SOTA video tokenizers with our method ProMAG at 4× temporal comparison. We find thatMagViTv2 achieves competitive results compared to other methods with 8 channel latent (zdim=8). Thus, we build our model ProMAG on top of MagViTv2. We show that even after making modifications to MagViTv2, for efficiency and enabling progressive growing, our model ProMAG with zdim=8 can achieve a similar reconstruction to MagViTv2. Finally, ProMAG with 16 channel latent (zdim=16) has comparable reconstruction quality to all SOTA video tokenizers with 16 channel latent (zdim=16).
Note: The goal of our work is not to build the best 4× temporal compression video tokenizer, but to create a method to increase the temporal compression factor from 4× to 8× and eventually 16× with high reconstruction quality from a 4× video tokenizer. The purpose of this comparison is to show that MagViTv2, and ProMAG are good 4× temporal compression video tokenizers, compared to other SOTA video tokenizers, and thus we build our progressive growing method on top of them.

Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)


Ground-Truth

MagViTv2 (zdim=8)

OmniTokenizer (zdim=8)

VidTok (zdim=8)

ProMAG (zdim=8)

Cosmos-CV (zdim=16)

CV-VAE (zdim=16)

CogVideoX (zdim=16)

WF-VAE (zdim=16)

ProMAG (zdim=16)