← Back to Computer Vision
cs.CV

Testing how well AI understands motion and space in videos

Jinho Park, Youbin Kim, Hogun Park, Eunbyung Park

May 21, 2026

Existing video understanding benchmarks rely on static images or real footage, making it hard to isolate what makes models struggle with spatio-temporal reasoning. VGenST-Bench flips the approach: it uses generative models to synthesize controlled videos spanning different spatial scales, viewpoints, and motion patterns, then pairs them with QA pairs verified by humans. The result is a fine-grained diagnostic tool that separates basic visual perception from actual reasoning about space and time—showing where today's multimodal language models genuinely fail.
Published as VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis arXiv:2605.22570
Read the original paper →