A benchmark for keeping characters consistent across long AI-generated videos

Multi-shot video generation struggles to keep the same character, object, or location visually consistent as sequences grow longer. EntityBench provides 140 episodes (2,491 shots) drawn from real narrative media, with explicit per-shot entity tracking across easy, medium, and hard tiers reaching up to 50 shots and recurrence gaps of 48 shots. Evaluation separates intra-shot quality, prompt alignment, and cross-shot consistency, with a fidelity gate that only counts accurate entity appearances in consistency scores. The authors also release EntityMem, a memory-augmented generation system that pre-stores verified visual references per entity; code and data are publicly available on GitHub.