← Back to Computer Vision
cs.CV

Steering video diffusion to follow complex instructions

Ariel Shaulov, Eitan Shaar, Amit Edenzon, Gal Chechik, Lior Wolf

May 14, 2026

Text-to-video diffusion models struggle with compositional prompts involving multiple entities, their attributes, spatial relations, and movements. CVG addresses this by training a lightweight classifier on cross-attention features from the frozen generator, then using its gradients during early denoising steps to steer the latent trajectory toward the desired composition. The approach transfers to semantically related labels and improves prompt faithfulness on compositional benchmarks while preserving visual quality, requiring no architecture changes, generator fine-tuning, or user-supplied layouts or boxes.
Published as Compositional Video Generation via Inference-Time Guidance arXiv:2605.14988
Read the original paper →