← Back to Computer Vision cs.CV
Testing how well AI understands motion and space in videos
Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park
May 21, 2026
Existing video understanding benchmarks rely on static images or real footage, making it hard to isolate what makes models struggle with spatio-temporal reasoning. VGenST-Bench flips the approach: it uses generative models to synthesize controlled videos spanning different spatial scales, viewpoints, and motion patterns, then pairs them with QA pairs verified by humans. The result is a fine-grained diagnostic tool that separates basic visual perception from actual reasoning about space and time—showing where today's multimodal language models genuinely fail.
Read the original paper →