← Back to Artificial Intelligence
cs.AI

Teaching two AIs to work together: planning then painting video

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

May 21, 2026

Bernini combines a language model (for planning what a video should contain) with a diffusion model (for actually generating pixels), letting each do what it does best. The language model predicts semantic layouts in image-embedding space, then the diffusion model renders realistic video from that plan plus text and visual details. Separate training keeps both components' original strengths intact while staying computationally efficient. Results lead on standard video benchmarks, especially for complex edits where semantic reasoning matters.
Published as Bernini: Latent Semantic Planning for Video Diffusion arXiv:2605.22344
Read the original paper →