← Back to Computer Vision cs.CV
Why audio-video-text search needs to learn from all three at once
Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen
May 26, 2026
Most multimodal search systems learn audio, video, and text separately. OmniRetriever trains a unified encoder by treating the fused triple-modal embedding as a teacher—forcing individual modalities to align with the complete signal. On six zero-shot benchmarks, it outperforms Google's Gemini Embedding 2 by 13–18% for audio searches, and introduces OmniRetriever-Bench, a 3,782-sample retrieval benchmark across all 12 direction combinations. Code and models are released.
Read the original paper →