Steering video diffusion to follow complex instructions

Text-to-video diffusion models struggle with compositional prompts involving multiple entities, their attributes, spatial relations, and movements. CVG addresses this by training a lightweight classifier on cross-attention features from the frozen generator, then using its gradients during early denoising steps to steer the latent trajectory toward the desired composition. The approach transfers to semantically related labels and improves prompt faithfulness on compositional benchmarks while preserving visual quality, requiring no architecture changes, generator fine-tuning, or user-supplied layouts or boxes.