WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Liang, Ao; Kong, Lingdong; Yan, Tianyi; Liu, Hongsi; Yang, Yu; Huang, Ziqi; Yin, Wei; Zuo, Jialong; Hu, Yixuan; Zhu, Dekai; Lu, Dongyue; Liu, Youquan; Jiang, Guangfeng; Li, Linfeng; Li, Xiangtai; Zhuo, Long; Ng, Lai Xing; Cottereau, Benoit R.; Gao, Changxin; Pan, Liang; Ooi, Wei Tsang; Liu, Ziwei

Ao Liang, Lingdong Kong, Tianyi Yan, Hongsi Liu, Yu Yang, Ziqi Huang, Wei Yin, Jialong Zuo, Yixuan Hu, Dekai Zhu, Dongyue Lu, Youquan Liu, Guangfeng Jiang, Linfeng Li, Xiangtai Li, Long Zhuo, Lai Xing Ng, Benoit R. Cottereau, Changxin Gao, Liang Pan, Wei Tsang Ooi, Ziwei Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36385-36399

Abstract

Generative world models are reshaping embodied AI, enabling agents to synthesize realistic 4D driving environments that look convincing but often fail physically or behaviorally. Despite rapid progress, the field still lacks a unified way to assess whether generated worlds preserve geometry, obey physics, or support reliable control. We introduce **WorldLens**, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects - Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference - jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability. Across these dimensions, no existing world model excels universally: those with strong textures often violate physics, while geometry-stable ones lack behavioral fidelity. To align objective metrics with human judgment, we further construct **WorldLens-26K**, a large-scale dataset of human-annotated videos with numerical scores and textual rationales, and develop **WorldLens-Agent**, an evaluation model distilled from these annotations to enable scalable, explainable scoring. Together, the benchmark, dataset, and agent form a unified ecosystem for measuring world fidelity - standardizing how future models are judged not only by how real they look, but by how real they behave.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Liang_2026_CVPR, author = {Liang, Ao and Kong, Lingdong and Yan, Tianyi and Liu, Hongsi and Yang, Yu and Huang, Ziqi and Yin, Wei and Zuo, Jialong and Hu, Yixuan and Zhu, Dekai and Lu, Dongyue and Liu, Youquan and Jiang, Guangfeng and Li, Linfeng and Li, Xiangtai and Zhuo, Long and Ng, Lai Xing and Cottereau, Benoit R. and Gao, Changxin and Pan, Liang and Ooi, Wei Tsang and Liu, Ziwei}, title = {WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {36385-36399} }