Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World

Huang, Yuzhi; Wen, Kairun; Gao, Rongxin; Liu, Dongxuan; Lou, Yibin; Wu, Jie; Xu, Jing; Zhang, Jian; Yang, Zheng; Lin, Yunlong; Li, Chenxin; Pan, Panwang; Lu, Junbin; Jiang, Jingyan; Ding, Xinghao; Huang, Yue; Wang, Zhi

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, Zhi Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 33446-33456

Abstract

Humans inhabit a physical 4D world, where spatial geometry and semantic content evolve over time, forming a dynamic reality. While current Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in understanding static visual inputs, it remains unclear whether they can effectively "think in dynamics," i.e., perceive, track, and reason about spatio-temporal evolution in complex scenes.To systematically evaluate these abilities, we introduce \texttt Dyn-Bench , a large-scale benchmark designed to assess spatio-temporal reasoning and localized dynamics perception. Constructed through multi-stage filtering over massive 2D and 4D data sources, \texttt Dyn-Bench provides a high-quality collection of diverse dynamic scenes, consisting of 1k videos, 7k visual question answering (VQA) pairs, and 3k dynamic object grounding samples.We comprehensively study general-purpose, spatial-aware, and region-level MLLMs to understand how they "think in dynamics" from both linguistic and visual perspectives. Our results reveal that existing models struggle to jointly excel in both spatio-temporal reasoning and dynamic object grounding, often producing inconsistent interpretations of motion and interaction. Conventional prompting strategies i.e., chain-of-thought or caption-based hints) provide only limited improvements.In contrast, structured integration approaches, including Mask-Guided Fusion and the Spatio-Temporal Textual Cognitive Map (ST-TCM), substantially enhance MLLMs' dynamic perception and spatio-temporal reasoning in an evolving 4D world. These findings underscore the importance of explicit spatio-temporal structural cues to bridge the gap between static perception and dynamic reasoning in MLLMs.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Huang_2026_CVPR, author = {Huang, Yuzhi and Wen, Kairun and Gao, Rongxin and Liu, Dongxuan and Lou, Yibin and Wu, Jie and Xu, Jing and Zhang, Jian and Yang, Zheng and Lin, Yunlong and Li, Chenxin and Pan, Panwang and Lu, Junbin and Jiang, Jingyan and Ding, Xinghao and Huang, Yue and Wang, Zhi}, title = {Thinking in Dynamics: How Multimodal Large Language Models Perceive, Track, and Reason Dynamics in Physical 4D World}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {33446-33456} }