Teaching two AIs to work together: planning then painting video

Bernini Team, Chenchen Liu, Junyi Chen, Lei Li, Lu Chi, Mingzhen Sun, Zhuoying Li, Yi Fu, Ruoyu Guo, Yiheng Wu, Ge Bai, Zehuan Yuan

Bernini combines a language model (for planning what a video should contain) with a diffusion model (for actually generating pixels), letting each do what it does best. The language model predicts semantic layouts in image-embedding space, then the diffusion model renders realistic video from that plan plus text and visual details. Separate training keeps both components' original strengths intact while staying computationally efficient. Results lead on standard video benchmarks, especially for complex edits where semantic reasoning matters.