Teaching AI to reason with raw audio and video, not just text

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang

Current multimodal AI models force audio and video into text descriptions before reasoning about them, losing temporal detail and relying too heavily on language patterns. LatentOmni keeps audio and visual information in a shared continuous space while still performing text-based reasoning, using a new position embedding method to keep audio and video synchronized. The authors release a 35K-example dataset of audio-visual reasoning tasks and show consistent improvements over text-only reasoning baselines across multiple benchmarks.