Remembering virtual worlds without breaking real-time speed

Jung Yi, Minjae Kim, Paul Hyunbin Cho, Wooseok Jang, Sangdoo Yun, Seungryong Kim

Autoregressive video diffusion models can generate interactive worlds in real-time, but they face a hard tradeoff: keeping perfect memory of past scenes kills frame rate, while fast inference forgets the world. WorldKV solves this by storing discarded memory chunks on GPU/CPU and selectively retrieving them based on camera position and action, while pruning redundant tokens within chunks. On two benchmarks, it matches full-memory consistency at double the speed with no fine-tuning required.