Generating long videos without retraining the model

Video diffusion models struggle to generate beyond their training length. This work proposes FlowLong, an inference-time method that stitches together sliding windows of video by matching predictions from overlapping regions (Tweedie matching) and re-injecting noise strategically to stay on the learned manifold. The approach works with any existing model, requires no retraining, and produces temporally coherent videos several times longer than native window length while beating autoregressive baselines that accumulate drift errors.