Faster video generation with frame-by-frame diffusion distillation

Min Zhao, Hongzhou Zhu, Kaiwen Zheng, Zihan Zhou, Bokai Yan, Xinyuan Li, Xiao Yang, Chongxuan Li, Jun Zhu

Real-time interactive video generation demands low latency and streaming capabilities. This work tackles frame-wise autoregressive generation with minimal sampling steps (1–2), identifying student model initialization as the critical bottleneck. Causal Forcing++ uses causal consistency distillation to learn from single online teacher steps rather than precomputed trajectories, cutting initialization cost and training time by ~4×. On VBench benchmarks, the 2-step method surpasses prior 4-step approaches (0.3 higher quality, 0.335 higher reward) while cutting first-frame latency in half. Code and project materials are available.