← Back to Computer Vision cs.CV
How 3D geometry fixes video generation's biggest weakness
Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler, Yiannis Aloimonos
May 29, 2026
Video diffusion models generate impressive footage but struggle with precise camera control and complex occlusions—they fall apart when inferring hidden geometry. Real2SAM2Real injects a 3D geometric scaffold extracted from SAM3D into the diffusion process, giving the model a physical anchor for what should move where. The approach decouples geometry from appearance and stays faithful to pre-trained priors, enabling stable video synthesis with dramatic camera shifts and severe occlusions while maintaining spatiotemporal coherence.
Read the original paper →