How to make diffusion models plan farther ahead without exploding compute costs

Diffusion models excel at generating short sequences, but extending them to long-horizon tasks breaks coherence—neighboring plans stay locally consistent yet form implausible global trajectories. CoFi separates this into two stages: first building a coarse structural scaffold capturing task-level arrangement, then refining details while preserving that scaffold. Across robotic manipulation, panoramic images, and long videos, it improves both global structure and sample quality while cutting denoiser calls by 2–8×.