VideoWorld 2: Learning Transferable Knowledge from Real-world Videos

Ren, Zhongwei; Wei, Yunchao; Yu, Xiao; Luo, Guixun; Zhao, Yao; Kang, Bingyi; Feng, Jiashi; Jin, Xiaojie

Zhongwei Ren, Yunchao Wei, Xiao Yu, Guixun Luo, Yao Zhao, Bingyi Kang, Jiashi Feng, Xiaojie Jin; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 40569-40580

Abstract

Learning transferable knowledge from unlabeled video data and applying it in new environments is a fundamental capability of intelligent agents. This work presents VideoWorld 2, which extends VideoWorld and provides the first investigation of learning transferable knowledge for complex, long-horizon tasks directly from raw real-world videos. At its core, VideoWorld 2 introduces a dynamics-enhanced Latent Dynamics Model (dLDM) that decouples action dynamics from visual appearance: a pretrained video diffusion model handles visual appearance modeling, enabling the dLDM to learn latent codes that focus on compact and meaningful task-related dynamics. These latent codes are then modeled autoregressively to learn task policies and support long-horizon reasoning. We evaluate VideoWorld 2 on challenging real-world handcraft making tasks, where prior video generation and latent-dynamics models struggle to operate reliably. Remarkably, VideoWorld 2 achieves up to 70% improvement in task success rate and produces coherent long execution videos. In robotics, we show that VideoWorld 2 can acquire transferable manipulation knowledge from the Open-X dataset, which substantially improves task performance on CALVIN, demonstrating strong cross-domain generalization. This study reveals the potential of learning transferable world knowledge directly from raw videos, with all code, data, and models open-sourced for further research.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Ren_2026_CVPR, author = {Ren, Zhongwei and Wei, Yunchao and Yu, Xiao and Luo, Guixun and Zhao, Yao and Kang, Bingyi and Feng, Jiashi and Jin, Xiaojie}, title = {VideoWorld 2: Learning Transferable Knowledge from Real-world Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {40569-40580} }