← Back to Computer Vision
cs.CV

How should you combine multiple senses to recognize things in new environments?

Qianshan Zhan, Qian Wang, Da Li, Xiao-Jun Zeng, Xiatian Zhu

May 30, 2026

Models struggle when trained on one environment but tested on another, and this gets harder when using multiple data types (video, audio, depth, thermal). This work introduces MMDG-Bench, the first systematic evaluation of multimodal domain generalization, testing two orders of fusion: combining modalities then adapting (M2D) versus adapting first then fusing (D2M). Across face anti-spoofing and action recognition, the framework choice matters: stable cross-domain relationships favor D2M, while M2D handles modal drift better. Code released.
Published as MMDG-Bench: A Benchmark for Multimodal Domain Generalization arXiv:2606.00891
Read the original paper →