How 3D geometry fixes video generation's biggest weakness

Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler, Yiannis Aloimonos

Video diffusion models generate impressive footage but struggle with precise camera control and complex occlusions—they fall apart when inferring hidden geometry. Real2SAM2Real injects a 3D geometric scaffold extracted from SAM3D into the diffusion process, giving the model a physical anchor for what should move where. The approach decouples geometry from appearance and stays faithful to pre-trained priors, enabling stable video synthesis with dramatic camera shifts and severe occlusions while maintaining spatiotemporal coherence.