← Back to Computer Vision
cs.CV

Why audio-video-text search needs to learn from all three at once

Yunze Liu, Chi-Hao Wu, Enmin Zhou, Junxiao Shen

May 26, 2026

Most multimodal search systems learn audio, video, and text separately. OmniRetriever trains a unified encoder by treating the fused triple-modal embedding as a teacher—forcing individual modalities to align with the complete signal. On six zero-shot benchmarks, it outperforms Google's Gemini Embedding 2 by 13–18% for audio searches, and introduces OmniRetriever-Bench, a 3,782-sample retrieval benchmark across all 12 direction combinations. Code and models are released.
Published as OmniRetriever: Any-to-Any Audio-Video-Text Retrieval via Fusion-as-Teacher Distillation arXiv:2605.26641
Read the original paper →