Teaching video models to respect physics through planning

Yuxiang Feng, Juncheng Wang, Chao Xu, Yijie Qian, Huihan Wang, Wenlong Hou, Yang Liu, Baigui Sun, Yong Liu, Shujun Wang

Video generation models create visually plausible footage but systematically violate physical laws—on the VideoPhy-2 benchmark, state-of-the-art models achieve only 32.6% accuracy at preserving realistic dynamics. The core problem is that text prompts cannot specify the full parameters needed to determine physical outcomes. NEWTON reframes video generation as one tool within an agentic system: a learned planner coordinates keyframe generation, physics simulation, and prompt refinement to create rich conditioning, then iteratively refines outputs using a verifier. Only the planner is trained, optimized on-policy via Flow-GRPO. On VideoPhy-2, NEWTON raises LTX-Video from 21.4% to 29.7% and Veo-3.1 from 30.7% to 37.4% joint accuracy without modifying either generator. Code and project details are available.