← Back to Computer Vision
cs.CV

Do AI assistants actually listen? Testing real-time multimodal models on live streams

Xudong Lu, Xueying Li, Annan Wang, Yang Bo, Jinpeng Chen, Zengliang Li, Nianzu Yang, Rui Liu, Xue Yang, Jingwen Hou, Hongsheng Li

May 26, 2026

OmniInteract tests whether large language models can handle real-time audio-visual streams like humans do: responding to questions, detecting when to speak, and handling interruptions without seeing future content. The benchmark includes 250 videos with 1,430 interaction scenarios—some requiring instant replies, others proactive decisions. All evaluated models performed poorly (best: 0.368 F1), and surprisingly, strong offline video-understanding ability didn't translate to online streaming. Code and data releasing publicly.
Published as OmniInteract: Benchmarking Real-World Streaming Interaction for Real-Time Omnimodal Assistants arXiv:2605.26485
Read the original paper →