Can vision models teach video AI to reason better?

Junhao Cheng, Liang Hou, Tianxiong Zhong, Xin Tao, Pengfei Wan, Kun Gai, Jing Liao

Video reasoning requires AI to follow complex rules through generated visual sequences, but current models fail at fine-grained instruction following. This work flips the paradigm: instead of using vision-language models to plan, use them as evaluators that give real-time feedback during video generation. A lightweight LoRA module optimizes on VLM-extracted rewards at test time, pushing performance well beyond what either component achieves alone on symbolic and general reasoning benchmarks.