Efficient world model generates minute-long videos with precise camera control

Haoyi Zhu, Haozhe Liu, Yuyang Zhao, Tian Ye, Junsong Chen, Jincheng Yu, Tong He, Song Han, Enze Xie

SANA-WM is an open-source world model that synthesizes minute-scale videos with precise 6-degree-of-freedom camera control. The architecture combines frame-wise linear attention with softmax attention for memory-efficient long-context modeling, uses a dual-branch design for camera trajectory adherence, and applies a two-stage pipeline to refine sequence quality. Trained on 213K publicly available videos with extracted pose annotations in just 15 days on 64 H100s, it generates 60-second clips on a single GPU and outperforms prior open-source baselines in action-following accuracy while achieving 36× higher throughput. A quantized variant runs on consumer GPUs.