← Back to Artificial Intelligence cs.AI
Teaching two AIs to work together: planning then painting video
Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan
May 21, 2026
Bernini combines a language model (for planning what a video should contain) with a diffusion model (for actually generating pixels), letting each do what it does best. The language model predicts semantic layouts in image-embedding space, then the diffusion model renders realistic video from that plan plus text and visual details. Separate training keeps both components' original strengths intact while staying computationally efficient. Results lead on standard video benchmarks, especially for complex edits where semantic reasoning matters.
Read the original paper →