Do AI assistants actually listen? Testing real-time multimodal models on live streams

Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li

OmniInteract tests whether large language models can handle real-time audio-visual streams like humans do: responding to questions, detecting when to speak, and handling interruptions without seeing future content. The benchmark includes 250 videos with 1,430 interaction scenarios—some requiring instant replies, others proactive decisions. All evaluated models performed poorly (best: 0.368 F1), and surprisingly, strong offline video-understanding ability didn't translate to online streaming. Code and data releasing publicly.