An Empirical Study of Autoregressive Pre-training from Videos

Jathushan Rajasegaran, Ilija Radosavovic, Rahul Ravishankar, Yossi Gandelsman, Christoph Feichtenhofer, Jitendra Malik; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 19108-19118

Abstract


We empirically study autoregressive pre-training from videos. To perform our study, we construct a series of autoregressive video models, called Toto. We treat videos as sequences of visual tokens and train transformer models to autoregressively predict future tokens. Our models are pre-trained on a diverse dataset of videos and images comprising over 1 trillion visual tokens. We explore different architectural, training, and inference design choices. We evaluate the learned visual representations on a range of downstream tasks including image recognition, video classification, object tracking, and robotics. Our results demonstrate that, despite minimal inductive biases, autoregressive pre-training leads to competitive performance across all benchmarks. Finally, we find that scaling our video models results in similar scaling curves to those seen in language models, albeit with a different rate.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Rajasegaran_2025_ICCV, author = {Rajasegaran, Jathushan and Radosavovic, Ilija and Ravishankar, Rahul and Gandelsman, Yossi and Feichtenhofer, Christoph and Malik, Jitendra}, title = {An Empirical Study of Autoregressive Pre-training from Videos}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {19108-19118} }