We introduce a new paradigm where rigid-body dynamics are not just predicted as numbers, but are described in a structured description, a YAML configuration capturing object properties, initial states, and camera pose.
Language is in itself scalable, naturally integrating to VLM, and easy to include reasoning.
Motion Reasoning: Before predicting parameters, the model produces a natural-language event description (e.g., "the cube slides right and collides with the cylinder"). This intermediate reasoning step significantly improves physical accuracy.
Semantics-Agnostic Input: Instead of raw pixels, the model consumes optical flow—focusing purely on motion while ignoring textures, lighting, and background distractions.