Header Image

MoReGen: Multi-Agent Motion-Reasoning Engine for Code-based Text-to-Video Synthesis

Authors: Xiangyu Bai, He Liang, Bishoy Galoaa, Utsav Nandi, Shayda Moezzi, Yuhang He, Sarah Ostadabbas

Abstract

While text-to-video (T2V) generation has achieved remarkable progress in photorealism, generating intent-aligned videos that faithfully obey physics principles remains a core challenge. In this work, we systematically study Newtonian motion-controlled text-to-video generation and evaluation, emphasizing physical precision and motion coherence. We introduce MoReGen, a motion-aware, physics-grounded T2V framework that integrates multi-agent LLMs, physics simulators, and renderers to generate reproducible, physically accurate videos from text prompts in the code domain. To quantitatively assess physical validity, we propose object-trajectory correspondence as a direct evaluation metric and present MoReSet, a benchmark of 1,275 human-annotated videos spanning nine classes of Newtonian phenomena with scene descriptions, spatiotemporal relations, and ground-truth trajectories. Using MoReSet, we conduct experiments on existing T2V models, evaluating their physical validity through both our MoRe metrics and existing physics-based evaluators. Our results reveal that state-of-the-art models struggle to maintain physical validity, while MoReGen establishes a principled direction toward physically coherent video synthesis.

Data-driven Models do not Understand Physics Principles

We show that even the most advanced text-to-video models lack physical compliance and reasoning (Left: Grok Imagine by XAI, Middle: Sora2 by OpenAI; Right: Veo3 by Google):

Prompt: Generate a video that showcase the following scene: Five shiny metal balls of a newton's cradle is visible, along with parts of a single vertical string for each metal ball respectively. These strings keeps their respective metal ball suspended. The top part of the newton's cradle is not visible. The camera faces all the five metal balls. The first and leftmost ball is at an angle of 30 degrees from the cradle and released. Due to gravity, the ball comes and strikes the second ball from the left. This causes momentum to be transferred to the fifth and the right most ball which is launched at a slightly lesser angle, having lost some momentum. This process keeps repeating itself till the rightmost ball has lost a lot of momentum when the video ends.

Overall Pipeline

The overall pipeline of the proposed model.

The overall pipeline of the proposed model. We highlight the following novel components in our MoReGen pipeline. Hover pointer to the bullet points to see corresponding components.

  1. Text-parser agent parses the raw description into a structured Newtonian specification containing all required objects, parameters, and initial conditions. This component is fine-tuned on our dataset.
  2. Code-writer agent converts specifications into executable physics simulation code, which is run inside a sandboxed environment to obtain object configurations and trajectories. Video-renderer agent then produces a rendering script that consumes these trajectories to generate the video.
  3. The evaluator analyzes output videos using grounded detectors, trackers and Visual-Language Models (VLMs) to provides feedback to other agents, guiding a multi-iteration refinement process.

MoReSet: Our Dataset Focusing on Newtonian Physics

We introduce MoReSet, a novel open-source benchmark for evaluating T2V models through the lens of physics fidelity. The dataset comprises of 1,275 videos spanning nine fundamental categories of Newtonian physics experiments: gravity, acceleration, collision, oscillation, momentum, buoyancy, inertia, pendular motion and pulley mechanics.

Video Prompt

Hover over a video to show prompt

Qualitative Comparsion Using Newton's Cradle

Prompt: Generate a video that showcase the following scene: Five shiny metal balls of a newton's cradle is visible, along with parts of a single vertical string for each metal ball respectively. These strings keeps their respective metal ball suspended. The top part of the newton's cradle is not visible. The camera faces all the five metal balls. The first and leftmost ball is at an angle of 30 degrees from the cradle and released. Due to gravity, the ball comes and strikes the second ball from the left. This causes momentum to be transferred to the fifth and the right most ball which is launched at a slightly lesser angle, having lost some momentum. This process keeps repeating itself till the rightmost ball has lost a lot of momentum when the video ends.

Reference

Hover over a video to show Reference

More Results from MoReGen

Prompt

Hover over a video to show prompt

Future Steps...

For enhanced realism, we explored conditioned video generation using paths derived from our simulation outputs. For this experiment, we employed conditioning to video Models. We show that without finetuning and domain adaption, it is very difficult to synthesize realistic and physics-accurate videos directly from video-to-video models.

Wan2.2-Fun (Alibaba 2025)

Motion mask to video. Hover on Generated videos to see text prompt. Motion mask (left column) is generated from synthesized trajectories.

Motion Mask — Newton's Cradle
Generated Video. Prompt: Generate a video that showcase the following scene: Five shiny metal balls of a newton's cradle is visible, along with parts of a single vertical string for each metal ball respectively. These strings keeps their respective metal ball suspended. The top part of the newton's cradle is not visible. The camera faces all the five metal balls. The first two balls from the left: balls 1 and 2 is at an angle of 30 degrees from the cradle and released simultaneously. Due to gravity, both the balls comes and strikes the third ball from the left. This causes momentum to be transferred to the fourth and fifth ball and they are launched at a slightly lesser angle towards the right, having lost some momentum. This process keeps repeating itself till the two of the rightmost ball have lost a lot of momentum when the video ends.
Motion Mask — Spring
Generated Video. Prompt: Generate a video that showcase the following scene: A mass-spring system can be seen. The spring has a stiffness of 20g/cm and the weight attached to this spring is 200g. Initially the spring is at a length of 5cm and a platform supporting the weight is removed. This causes the spring to oscillate. This oscillation continues for a significant amount of time. The maximum length of the spring is 25cm. The video stops after the oscillation has occoured for several times.

Motion I2V (SIGGRAPH 2024)

Motion trajectory to video. Hover on Generated videos to see text prompt. Trajectories conditioning (left column) is generated from synthesized trajectories and initial image is sampled from MoReSet.

Motion Mask — Cradle
Generated GIF — Newton's Cradle
Generated Video (GIF). Prompt: Generate a video that showcase the following scene: Five shiny metal balls of a newton's cradle is visible, along with parts of a single vertical string for each metal ball respectively. These strings keeps their respective metal ball suspended. The top part of the newton's cradle is not visible. The camera faces all the five metal balls. The first two balls from the left: balls 1 and 2 is at an angle of 30 degrees from the cradle and released simultaneously. Due to gravity, both the balls comes and strikes the third ball from the left. This causes momentum to be transferred to the fourth and fifth ball and they are launched at a slightly lesser angle towards the right, having lost some momentum. This process keeps repeating itself till the two of the rightmost ball have lost a lot of momentum when the video ends.
Motion Mask — Spring
Generated GIF — Spring
Generated Video (GIF). Prompt: Generate a video that showcase the following scene: A mass-spring system can be seen. The spring has a stiffness of 20g/cm and the weight attached to this spring is 200g. Initially the spring is at a length of 5cm and a platform supporting the weight is removed. This causes the spring to oscillate. This oscillation continues for a significant amount of time. The maximum length of the spring is 25cm. The video stops after the oscillation has occoured for several times.

Cosmos Transfer 2.5 (NVIDIA 2025)

Edge map to video. Hover on Generated videos to see text prompt. Edge map (left column) is generated from synthesized videos.

Edge Mask — pendulum
Generated Video. Prompt: Generate a video that showcase the following scene: A simple pendulum is formed by a 226 g steel bob attached to a lightweight, inextensible string 68 cm in length, hanging from a fixed point 86 cm above the ground. The bob is initially pulled to the right, creating a 20-degree angle from the vertical, and released from rest. It then swings in a circular arc, exhibiting periodic motion driven by gravitational acceleration (g = 9.8 m/s²).
Edge Mask — Dominos
Generated Video. Prompt: Generate a video that showcase this scene: Ten wooden planks, each 0.5cm thin, placed at a distance of 5cm apart. The rightmost one is pushed from the top. This causes it to topple over and hit the second one from the right. This causes a chain reaction and all the wooden planks fall and topple the one towards its left. The video ends when the all the ten planks are toppled over.