A benchmark for video models that know when to speak

Ruixiang Zhao, Jie Yang, Zijie Xin, Tianyi Wang, Fengyun Rao, Jing LYU, Xirong Li

OmniPro addresses a gap in evaluating omni-modal large language models on streaming video understanding—specifically, their ability to autonomously decide when and what to respond to in continuous audio-visual input. The benchmark spans 2,700 human-verified samples covering 9 sub-tasks at three cognitive levels, with 84% requiring audio signals (speech or non-speech). A dual-mode evaluation protocol includes Probe mode for content understanding and Online mode for assessing fully proactive response capability. Testing 11 models reveals that audio consistently improves performance but is underutilized, robustness degrades over extended streams, and non-speech audio perception remains a critical weakness. Code and benchmark are released.