Why audio-video-text search needs to learn from all three at once

Most multimodal search systems learn audio, video, and text separately. OmniRetriever trains a unified encoder by treating the fused triple-modal embedding as a teacher—forcing individual modalities to align with the complete signal. On six zero-shot benchmarks, it outperforms Google's Gemini Embedding 2 by 13–18% for audio searches, and introduces OmniRetriever-Bench, a 3,782-sample retrieval benchmark across all 12 direction combinations. Code and models are released.