Testing how well AI understands motion and space in videos

Existing video understanding benchmarks rely on static images or real footage, making it hard to isolate what makes models struggle with spatio-temporal reasoning. VGenST-Bench flips the approach: it uses generative models to synthesize controlled videos spanning different spatial scales, viewpoints, and motion patterns, then pairs them with QA pairs verified by humans. The result is a fine-grained diagnostic tool that separates basic visual perception from actual reasoning about space and time—showing where today's multimodal language models genuinely fail.