How should you combine multiple senses to recognize things in new environments?

Models struggle when trained on one environment but tested on another, and this gets harder when using multiple data types (video, audio, depth, thermal). This work introduces MMDG-Bench, the first systematic evaluation of multimodal domain generalization, testing two orders of fusion: combining modalities then adapting (M2D) versus adapting first then fusing (D2M). Across face anti-spoofing and action recognition, the framework choice matters: stable cross-domain relationships favor D2M, while M2D handles modal drift better. Code released.