Why AI video gets stuck at the first frame—and how to fix it

Video diffusion models generate frame-by-frame by anchoring to the first frame's representation, which dominates attention and locks the scene in place. This dampens motion, camera movement, and scene evolution. Researchers replace this static anchor with an adaptive latent state that evolves at each generation step, treating time as relative rather than absolute. The model now attends to both previous state and current content to build its own reference dynamically, enabling substantially more natural video progression without external modules.