Making video generators respect physics and geometry

Jan Ackermann, Shengqu Cai, Boyang Deng, Zhengfei Kuang, Songyou Peng, Gordon Wetzstein

Text-to-video diffusion models generate plausible motion but often produce geometric artifacts—deformed objects, texture drift, and unrealistic background movement. GeoFlow introduces a geometry-consistency reward that judges whether a generated video respects physical principles: backgrounds should move only due to camera motion (rigid flow), while moving objects should maintain visual identity. The method uses optical flow, depth-pose estimation, and feature matching to separate rigid and dynamic regions, then applies reinforcement learning to fine-tune any video model explicitly for this objective. Experiments show substantial reduction in geometric artifacts while maintaining visual quality. Code and weights are released.