Can video generators learn physics from their own training data?

Video diffusion models generate visually smooth clips but fail at physics—objects move implausibly, violate momentum, collide wrongly. LaMo trains on the same unlabeled videos used for generation, learning latent motion patterns as a self-supervised signal. During generation, it applies two plug-and-play components: a motion drift loss during training and motion guidance during sampling. On physics benchmarks (VideoPhy, VideoPhy2), it outperforms physics-aware baselines that need external supervision or teacher models, while maintaining visual quality on general benchmarks.