We show that the reconstruction obtained by directly training MagViTv2 for 16× temporal compression leads to poor reconstruction quality for a 24fps video (middle). 𝑠𝑓=1 stands for frame subsampling factor = 1. However, we observed that the 4× temporal compression MagViTv2 can still accurately reconstruct a 6fps video by feeding the same 24fps video after subsampling frames by a factor of 4, 𝑠𝑓=4, (right). This observation implies that it is not necessarily the large motion that leads to worse reconstruction, but that training many downsampling (upsampling) layers of encoder (decoder) at once makes training difficult.
Ground-Truth
MagViTv2-16× (𝑠𝑓=1)
MagViTv2-4× (𝑠𝑓=4)