← Back to Computer Vision cs.CV
Natural language commands for video game world models
Shangwen Zhu, Qianyu Peng, Zhao Pu, Zhilei Shu, Xiangrui Ke, Zhaohu Xing, Zizhao Tong, Zeqing Wang, Xinyu Cui, Huangji Wang, Jian Zhao, Yeying Jin, Fan Cheng, Ruili Feng
May 18, 2026
Existing interactive video world models lack fine-grained control over multiple entities and struggle to generalize across different game engines or characters. Incantation replaces standard action interfaces (animation IDs, device inputs) with natural language conditioning, processed at 0.25-second intervals per frame through a pretrained video backbone with text cross-attention. The approach enables real-time streaming (19.7 FPS at 480p) via ODE-initialized Self-Forcing distillation and supports concept transfer between entities—demonstrated on Elden Ring and The King of Fighters. On cross-entity transfer tasks, it achieves 89% accuracy versus 43% for action-index baselines, and 90% on out-of-vocabulary prompts versus 0%. Code, model weights, and a preview dataset of structured combat clips are released.
Read the original paper →