Why video AI misses the crucial split-second moments

Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li, Xin Li, Haoyu Cao, Xing Sun, Shaofeng Zhang, Xu Yang, Zhihang Zhong, Xue Yang

Video language models ace long-form understanding but stumble on split-second events—a quick action, a state change, a single-frame detail. Researchers created Moment-Video, a 1,000-question benchmark exposing this blind spot across 33 models. Even top performers hit only 39.6% accuracy when forced to spot, count, or reason about transient visual evidence. The problem: sparse frame sampling and temporal compression skip or blur the evidence before language reasoning can help.