← Back to Computer Vision
cs.CV

Can vision models teach video AI to reason better?

Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao, Pengfei Wan, Kun Gai, Jing Liao

June 1, 2026

Video reasoning requires AI to follow complex rules through generated visual sequences, but current models fail at fine-grained instruction following. This work flips the paradigm: instead of using vision-language models to plan, use them as evaluators that give real-time feedback during video generation. A lightweight LoRA module optimizes on VLM-extracted rewards at test time, pushing performance well beyond what either component achieves alone on symbolic and general reasoning benchmarks.
Published as VLMs are Good Teachers for Video Reasoning via Adaptive Test-Time Optimization arXiv:2606.02564
Read the original paper →