We show a comparison of reconstruction results of different SOTA video tokenizers with our method ProMAG at 4× temporal comparison.
We find thatMagViTv2 achieves competitive results compared to other methods with 8 channel latent (zdim=8).
Thus, we build our model ProMAG on top of MagViTv2.
We show that even after making modifications to MagViTv2, for efficiency and enabling progressive growing, our model ProMAG with zdim=8 can achieve a similar reconstruction to MagViTv2.
Finally, ProMAG with 16 channel latent (zdim=16) has comparable reconstruction quality to all SOTA video tokenizers with 16 channel latent (zdim=16).
Note: The goal of our work is not to build the best 4× temporal compression video tokenizer, but to create a method to increase the temporal compression factor from 4× to 8× and eventually 16× with high reconstruction quality from a 4× video tokenizer.
The purpose of this comparison is to show that MagViTv2, and ProMAG are good 4× temporal compression video tokenizers, compared to other SOTA video tokenizers, and thus we build our progressive growing method on top of them.
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)
Ground-Truth
MagViTv2 (zdim=8)
OmniTokenizer (zdim=8)
VidTok (zdim=8)
ProMAG (zdim=8)
Cosmos-CV (zdim=16)
CV-VAE (zdim=16)
CogVideoX (zdim=16)
WF-VAE (zdim=16)
ProMAG (zdim=16)