← Back to Computer Vision
cs.CV

Natural language commands for video game world models

Shangwen Zhu, Qianyu Peng, Zhao Pu, Zhilei Shu, Xiangrui Ke, Zhaohu Xing, Zizhao Tong, Zeqing Wang, Xinyu Cui, Huangji Wang, Jian Zhao, Yeying Jin, Fan Cheng, Ruili Feng

May 18, 2026

Existing interactive video world models lack fine-grained control over multiple entities and struggle to generalize across different game engines or characters. Incantation replaces standard action interfaces (animation IDs, device inputs) with natural language conditioning, processed at 0.25-second intervals per frame through a pretrained video backbone with text cross-attention. The approach enables real-time streaming (19.7 FPS at 480p) via ODE-initialized Self-Forcing distillation and supports concept transfer between entities—demonstrated on Elden Ring and The King of Fighters. On cross-entity transfer tasks, it achieves 89% accuracy versus 43% for action-index baselines, and 90% on out-of-vocabulary prompts versus 0%. Code, model weights, and a preview dataset of structured combat clips are released.
Published as Incantation: Natural Language as the Action Interface for Multi-Entity Video World Models arXiv:2605.18601
Read the original paper →