∆YNAMICS: Language-Based Representation for Inferring Rigid-Body Dynamics From Videos

Anonymous Authors
Read Abstract All Results

Abstract

1

Can Motion Be Represented as Language?

We introduce a new paradigm where rigid-body dynamics are not just predicted as numbers, but are described in a structured description, a YAML configuration capturing object properties, initial states, and camera pose.

2

Why a Text-Based Representation?

Language is in itself scalable, naturally integrating to VLM, and easy to include reasoning.

3

Secret Ingredients for ΔYNAMICS

Motion Reasoning: Before predicting parameters, the model produces a natural-language event description (e.g., "the cube slides right and collides with the cylinder"). This intermediate reasoning step significantly improves physical accuracy.

Semantics-Agnostic Input: Instead of raw pixels, the model consumes optical flow—focusing purely on motion while ignoring textures, lighting, and background distractions.

Results