Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation

Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, Ziwei Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 4495-4500

Abstract


Simulating fine-grained robot-world interactions remains a pivotal challenge in Embodied AI. While generative video models offer a flexible alternative to traditional simulators, existing approaches often struggle to maintain precise 4D spatiotemporal consistency during complex interactions. To address this, we present Kinema4D, an action-conditioned generative simulator designed to disentangle robotic control from environmental dynamics. Our framework represents robot kinematics as explicit 3D trajectories, which are projected into spatiotemporal pointmaps to steer a generative model. This allows for the synthesis of reactive environmental dynamics into synchronized RGB and pointmap sequences with high metric fidelity. Experiments across diverse real-world scenarios demonstrate that Kinema4D effectively generates physically plausible and geometry-consistent interactions.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Xu_2026_CVPR, author = {Xu, Mutian and Zhang, Tianbao and Liu, Tianqi and Chen, Zhaoxi and Han, Xiaoguang and Liu, Ziwei}, title = {Kinema4D: Kinematic 4D World Modeling for Spatiotemporal Embodied Simulation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {4495-4500} }