PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding

Nie, Xuesong; Jin, Haoyuan; Yan, Yunfeng; Chen, Xi; Zhu, Zhihang; Qi, Donglian

Xuesong Nie, Haoyuan Jin, Yunfeng Yan, Xi Chen, Zhihang Zhu, Donglian Qi; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18143-18152

Abstract

Predictive learning models which aim to predict future frames based on past observations are crucial to constructing world models. These models need to maintain low-level consistency and capture high-level dynamics in unannotated spatiotemporal data. Transitioning from frame-wise to token-wise prediction presents a viable strategy for addressing these needs. How to improve token representation and optimize token decoding presents significant challenges. This paper introduces PredToken a novel predictive framework that addresses these issues by decoupling space-time tokens into distinct components for iterative cascaded decoding. Concretely we first design a "decomposition quantization and reconstruction" schema based on VQGAN to improve the token representation. This scheme disentangles low- and high-frequency representations and employs a dimension-aware quantization model allowing more low-level details to be preserved. Building on this we present a "coarse-to-fine iterative decoding" method. It leverages dynamic soft decoding to refine coarse tokens and static soft decoding for fine tokens enabling more high-level dynamics to be captured. These designs make PredToken produce high-quality predictions. Extensive experiments demonstrate the superiority of our method on various real-world spatiotemporal predictive benchmarks. Furthermore PredToken can also be extended to other visual generative tasks to yield realistic outcomes.

Related Material

[pdf]

[bibtex]

@InProceedings{Nie_2024_CVPR, author = {Nie, Xuesong and Jin, Haoyuan and Yan, Yunfeng and Chen, Xi and Zhu, Zhihang and Qi, Donglian}, title = {PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18143-18152} }