How to extend short video clips into minute-long cinematic sequences?

Akide Liu, Jinbo Xing, Chaojie Mao, Ye Li, Zeyu Zhang, Yefei He, Weijie Wang, Zihan Wang, Yu Liu, Gholamreza Haffari, Bohan Zhuang

Generating long, coherent videos remains unsolved because current models either preserve a single shot or invent freely without grounding in the original footage. ReCA treats long-video generation as a hierarchical decomposition problem: it recursively breaks the task into smaller, context-bounded chunks, uses frozen short-video generators at each step, and propagates visual state (identity, scene, objects) across time to prevent drift. On a new benchmark for 3–5 minute generation, ReCA outperforms competing methods by 8–16% on overall quality and maintains much better shot consistency.