← Back to Computer Vision
cs.CV

How 3D geometry fixes video generation's biggest weakness

Jiayi Wu, Haoming Cai, Cornelia Fermuller, Christopher Metzler, Yiannis Aloimonos

May 29, 2026

Video diffusion models generate impressive footage but struggle with precise camera control and complex occlusions—they fall apart when inferring hidden geometry. Real2SAM2Real injects a 3D geometric scaffold extracted from SAM3D into the diffusion process, giving the model a physical anchor for what should move where. The approach decouples geometry from appearance and stays faithful to pre-trained priors, enabling stable video synthesis with dramatic camera shifts and severe occlusions while maintaining spatiotemporal coherence.
Published as Real2SAM2Real: Generative 3D Caches as Complementary Context for Video Diffusion arXiv:2606.00299
Read the original paper →