← Back to Computation and Language
cs.CL

Teaching AI to reason with raw audio and video, not just text

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang

May 21, 2026

Current multimodal AI models force audio and video into text descriptions before reasoning about them, losing temporal detail and relying too heavily on language patterns. LatentOmni keeps audio and visual information in a shared continuous space while still performing text-based reasoning, using a new position embedding method to keep audio and video synchronized. The authors release a 35K-example dataset of audio-visual reasoning tasks and show consistent improvements over text-only reasoning baselines across multiple benchmarks.
Published as LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning arXiv:2605.22012
Read the original paper →